SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Apache Calcite
One Front-end to Rule Them All
Michael Mior, PMC Chair
Overview
● What is Apache Calcite?
● Calcite components
● Streaming SQL
● Next steps and contributing to Calcite
What is Apache Calcite?
● An ANSI-compliant SQL parser
● A logical query optimizer
● A heterogenous data processing framework
Origins
2004 LucidEra and SQLstream were each building SQL systems
2012 Code pared down and entered the ASF incubator
2015 Graduated from incubator
2016 I joined the Calcite project as a committer
2017 Joined the PMC and was voted as chair
2018 Paper presented at SIGMOD
Powered by Calcite
● Many open source projects
(Apache Hive, Apache Drill, Apache Phoenix, Lingual, …)
● Commercial products
(MapD, Dremio, Qubole, …)
● Contributors from Huawei, Uber, Intel, Salesforce, …
Powered by Calcite
Conventional Architecture
JDBC Client JDBC Server
SQL Parser
Optimizer
Datastore
Metadata
Operators
Calcite Architecture
JDBC Client JDBC Server
SQL Parser
Optimizer
3rd party
data
Pluggable
Metadata
Adapters
Pluggable
Rules
3rd party
data
Avatica
Optimizer
● Operates on relational algebra by matching rules
● Calcite contains 100+ rewrite rules
● Currently working on validating these using Cosette
● Optimization is cost-based
● “Calling convention” allows optimization across backends
Example rules
● Join order transposition
● Transpose different operators (e.g. project before join)
● Merge adjacent operators
● Materialized view query rewriting
Optimizer
● Based on the Volcano optimizer generator
○ Logical operators are functions (e.g. join)
○ Physical operators implement logical operators
○ Physical properties are attributes of the data
(e.g. sorting, partitioning)
● Start with logical expressions and physical properties
● Optimization produces a plan with only physical operators
Materialized views
Performance
Relational Algebra and Streaming
● Scan
● Filter
● Project
● Join
● Sort
● Aggregate
● Union
● Values
● Delta (relation to stream)
● Chi (stream to relation)
Adapters
● Connect to different backends (not just relational)
● Only required operation is a table scan
● Allow push down of filter, sort, etc.
● Calcite implements remaining operators
● Calling convention allows Calcite to separate
backend-specific operators and generic implementations
● Any relational algebra operator can be pushed down
● Operator push down simply requires a new optimizer rule
Adapters
Conventions
1. Plans start as
logical nodes
3. Fire rules to
propagate conventions
to other nodes
2. Assign each
Scan its table’s
native convention
4. The best plan may
use an engine not tied
to any native format
Join
Filter Scan
ScanScan
Join
Join
Filter Scan
ScanScan
Join
Scan
ScanScan
Join
Filter
Join
Join
Filter Scan
ScanScan
Join
Conventions
● Conventions are a uniform representation
of hybrid queries
● Physical property of nodes
(like ordering, distribution)
● Adapter =
Schema factory +
Convention +
Rules to convert to a convention
Join
Filter Scan
ScanScan
Join
● Column store database
● Uses tables partitioned across servers and clustered
● Supports limited filtering and sorting
Apache Cassandra Adapter
Query example
CREATE TABLE playlists (id uuid, song_order int,
song_id uuid, title text, artist text,
PRIMARY KEY (id, song_order));
SELECT title FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204 AND
artist='Relient K' ORDER BY song_order;
SELECT * FROM playlists;
Query example
Sort
Scan
Project
Filter
● Start with a table scan
● Remaining operations performed by Calcite
SELECT * FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204;
Query example
Sort
Scan
Project
Filter
Filter
● Push the filter on the partition key to Cassandra
● The remaining filter is performed by Calcite
SELECT * FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order;
Query example
Filter
Scan
Project
Filter
Sort
● Push the ordering to Cassandra
● This uses the table’s clustering key
SELECT title, album FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order;
Query example
Scan
Filter
Filter
Sort
Project
Project
● Push down the project of necessary fields
● This is the query sent to Cassandra
● Only the filter and project are done by Calcite
● Materialized view maintenance
● View-based query rewriting
● Full SQL support
● Join with other data sources
What we get for free
● All data must be modeled as relations
● Easy for relational databases
● Also relatively easy for many wide column stores
● What about document stores?
Data representation
Semistructured Data
● Columns can have complex types (e.g. arrays and maps)
● Add UNNEST operator to relational algebra
● New rules can be added to optimize these queries
name age pets
Sally 29 [{name: Fido,
type: Dog},
{name: Jack,
type: Cat}]
name age pets
Sally 29 {name: Fido,
type: Dog}
Sally 29 {name: Jack,
type: Cat}
MongoDB Adapter
_MAP
{ _id : 02401, city : BROCKTON, loc : [
-71.03434799999999, 42.081571 ], pop
: 59498, state : MA }
{ _id : 06902, city : STAMFORD, loc : [
-73.53742800000001, 41.052552 ], pop
: 54605, state : CT }
● Use one column with the whole document
● Unnest attributes as needed
● This is very messy, but we have
no schema to work with
MongoDB Adapter
id city latitude longitude population state
02401 BROCKTON -71.034348 42.081571 59498 MA
06902 STAMFORD -73.537428 41.052552 54605 CT
● Views to the rescue!
● Users of adapters can define structured views over
semistructured data (or do this lazily! See Apache Drill)
Available Adapters
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
Table: splunk
SELECT p.productName, COUNT(*) AS c
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.productId = p.productId
WHERE s.action = 'purchase'
GROUP BY p.productName
ORDER BY c DESC
FilterIntoJoin
MySQL
Splunk
group
Key: productName
Agg: count
sort
Key: c desc
FilterIntoJoin
join
Key: productId
filter
Condition:
action = 'purchase'
scan
scan
Table: splunk
Table: products
SELECT p.productName, COUNT(*) AS c
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.productId = p.productId
WHERE s.action = 'purchase'
GROUP BY p.productName
ORDER BY c DESC
Streaming Data
● Calcite supports multiple windowing algorithms
(e.g. tumbling, sliding, hopping)
● Streaming queries can be combined with tables
● Streaming queries can be optimized using the same rules
along with new rules specifically for streaming queries
Streaming Data
● Relations can be used both as streams and tables
● Calcite is a reference implementation for streaming SQL
(still being standardized)
SELECT STREAM * FROM Orders AS o WHERE units >
(SELECT AVG(units) FROM Orders AS h WHERE
h.productId = o.productId AND h.rowtime >
o.rowtime - INTERVAL ‘1’ YEAR)
Windowing
Tumbling window
Hopping window
Session window
SELECT STREAM … FROM Orders
GROUP BY FLOOR(rowtime TO HOUR)
SELECT STREAM … FROM Orders
GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR)
SELECT STREAM … FROM Orders
GROUP BY HOP(rowtime, INTERVAL ‘1’ HOUR,
INTERVAL ‘2’ HOUR)
SELECT STREAM … FROM Orders
GROUP BY SESSION(rowtime, INTERVAL ‘1’ HOUR)
My Use Case
● Perform view-based query rewriting to provide a logical
model over a denormalized data store
● Denormalized tables are views over (non-materialized)
logical tables
● Queries can be rewritten from logical tables to the most
cost-efficient choice of materialized views
Use Cases
● Parsing and validating SQL (not so easy)
● Adding a relational front end to an existing system
● Prototyping new query processing algorithms
● Integrating data from multiple backends
● Allowing RDBMS tools to work with non-relational DBs
Calcite Project Future Work
● Geospatial queries
● Processing scientific data formats
● Sharing data in-memory between backends
● Additional query execution engines
My Future Work
● Better cost modeling
● Query-based data source selection
● Cost-based database system selection
Contributing to Apache Calcite
● Pick an existing issue or file a new one and start coding!
● Mailing list is generally very active
● New committers and PMC members regularly added
● Many opportunities for projects at various scales
Additional areas for contribution
● Testing (SQL is hard!)
● Incorporating state-of-the-art in DB research
● Access control across multiple systems
● Adapters for new classes of database (eg. array DBs)
● Implement missing SQL features (e.g. set operations)
…
Thanks to
● Edmon Begoli, Oak Ridge National Laboratory
● Jesús Camacho-Rodríguez, Hortonworks
● Julian Hyde, Hortonworks
● Daniel Lemire, Université du Québec (TÉLUQ)
● All other Calcite contributors!
Questions?
https://calcite.apache.org/

Contenu connexe

Tendances

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 

Tendances (20)

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & Cassandra
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 

Similaire à Apache Calcite: One Frontend to Rule Them All

Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
DataWorks Summit
 

Similaire à Apache Calcite: One Frontend to Rule Them All (20)

Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Offline first: application data and synchronization
Offline first: application data and synchronizationOffline first: application data and synchronization
Offline first: application data and synchronization
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCRMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQL
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
 

Plus de Michael Mior

Locomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database codeLocomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database code
Michael Mior
 

Plus de Michael Mior (6)

A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academia
 
Physical Design for Non-Relational Data Systems
Physical Design for Non-Relational Data SystemsPhysical Design for Non-Relational Data Systems
Physical Design for Non-Relational Data Systems
 
Locomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database codeLocomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database code
 
Automated Schema Design for NoSQL Databases
Automated Schema Design for NoSQL DatabasesAutomated Schema Design for NoSQL Databases
Automated Schema Design for NoSQL Databases
 
NoSE: Schema Design for NoSQL Applications
NoSE: Schema Design for NoSQL ApplicationsNoSE: Schema Design for NoSQL Applications
NoSE: Schema Design for NoSQL Applications
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Apache Calcite: One Frontend to Rule Them All

  • 1. Apache Calcite One Front-end to Rule Them All Michael Mior, PMC Chair
  • 2. Overview ● What is Apache Calcite? ● Calcite components ● Streaming SQL ● Next steps and contributing to Calcite
  • 3. What is Apache Calcite? ● An ANSI-compliant SQL parser ● A logical query optimizer ● A heterogenous data processing framework
  • 4. Origins 2004 LucidEra and SQLstream were each building SQL systems 2012 Code pared down and entered the ASF incubator 2015 Graduated from incubator 2016 I joined the Calcite project as a committer 2017 Joined the PMC and was voted as chair 2018 Paper presented at SIGMOD
  • 5. Powered by Calcite ● Many open source projects (Apache Hive, Apache Drill, Apache Phoenix, Lingual, …) ● Commercial products (MapD, Dremio, Qubole, …) ● Contributors from Huawei, Uber, Intel, Salesforce, …
  • 7. Conventional Architecture JDBC Client JDBC Server SQL Parser Optimizer Datastore Metadata Operators
  • 8. Calcite Architecture JDBC Client JDBC Server SQL Parser Optimizer 3rd party data Pluggable Metadata Adapters Pluggable Rules 3rd party data
  • 10. Optimizer ● Operates on relational algebra by matching rules ● Calcite contains 100+ rewrite rules ● Currently working on validating these using Cosette ● Optimization is cost-based ● “Calling convention” allows optimization across backends
  • 11. Example rules ● Join order transposition ● Transpose different operators (e.g. project before join) ● Merge adjacent operators ● Materialized view query rewriting
  • 12. Optimizer ● Based on the Volcano optimizer generator ○ Logical operators are functions (e.g. join) ○ Physical operators implement logical operators ○ Physical properties are attributes of the data (e.g. sorting, partitioning) ● Start with logical expressions and physical properties ● Optimization produces a plan with only physical operators
  • 15. Relational Algebra and Streaming ● Scan ● Filter ● Project ● Join ● Sort ● Aggregate ● Union ● Values ● Delta (relation to stream) ● Chi (stream to relation)
  • 16. Adapters ● Connect to different backends (not just relational) ● Only required operation is a table scan ● Allow push down of filter, sort, etc. ● Calcite implements remaining operators
  • 17. ● Calling convention allows Calcite to separate backend-specific operators and generic implementations ● Any relational algebra operator can be pushed down ● Operator push down simply requires a new optimizer rule Adapters
  • 18. Conventions 1. Plans start as logical nodes 3. Fire rules to propagate conventions to other nodes 2. Assign each Scan its table’s native convention 4. The best plan may use an engine not tied to any native format Join Filter Scan ScanScan Join Join Filter Scan ScanScan Join Scan ScanScan Join Filter Join Join Filter Scan ScanScan Join
  • 19. Conventions ● Conventions are a uniform representation of hybrid queries ● Physical property of nodes (like ordering, distribution) ● Adapter = Schema factory + Convention + Rules to convert to a convention Join Filter Scan ScanScan Join
  • 20. ● Column store database ● Uses tables partitioned across servers and clustered ● Supports limited filtering and sorting Apache Cassandra Adapter
  • 21. Query example CREATE TABLE playlists (id uuid, song_order int, song_id uuid, title text, artist text, PRIMARY KEY (id, song_order)); SELECT title FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204 AND artist='Relient K' ORDER BY song_order;
  • 22. SELECT * FROM playlists; Query example Sort Scan Project Filter ● Start with a table scan ● Remaining operations performed by Calcite
  • 23. SELECT * FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204; Query example Sort Scan Project Filter Filter ● Push the filter on the partition key to Cassandra ● The remaining filter is performed by Calcite
  • 24. SELECT * FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204 ORDER BY song_order; Query example Filter Scan Project Filter Sort ● Push the ordering to Cassandra ● This uses the table’s clustering key
  • 25. SELECT title, album FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204 ORDER BY song_order; Query example Scan Filter Filter Sort Project Project ● Push down the project of necessary fields ● This is the query sent to Cassandra ● Only the filter and project are done by Calcite
  • 26. ● Materialized view maintenance ● View-based query rewriting ● Full SQL support ● Join with other data sources What we get for free
  • 27. ● All data must be modeled as relations ● Easy for relational databases ● Also relatively easy for many wide column stores ● What about document stores? Data representation
  • 28. Semistructured Data ● Columns can have complex types (e.g. arrays and maps) ● Add UNNEST operator to relational algebra ● New rules can be added to optimize these queries name age pets Sally 29 [{name: Fido, type: Dog}, {name: Jack, type: Cat}] name age pets Sally 29 {name: Fido, type: Dog} Sally 29 {name: Jack, type: Cat}
  • 29. MongoDB Adapter _MAP { _id : 02401, city : BROCKTON, loc : [ -71.03434799999999, 42.081571 ], pop : 59498, state : MA } { _id : 06902, city : STAMFORD, loc : [ -73.53742800000001, 41.052552 ], pop : 54605, state : CT } ● Use one column with the whole document ● Unnest attributes as needed ● This is very messy, but we have no schema to work with
  • 30. MongoDB Adapter id city latitude longitude population state 02401 BROCKTON -71.034348 42.081571 59498 MA 06902 STAMFORD -73.537428 41.052552 54605 CT ● Views to the rescue! ● Users of adapters can define structured views over semistructured data (or do this lazily! See Apache Drill)
  • 32. MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products Table: splunk SELECT p.productName, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.productId = p.productId WHERE s.action = 'purchase' GROUP BY p.productName ORDER BY c DESC FilterIntoJoin
  • 33. MySQL Splunk group Key: productName Agg: count sort Key: c desc FilterIntoJoin join Key: productId filter Condition: action = 'purchase' scan scan Table: splunk Table: products SELECT p.productName, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.productId = p.productId WHERE s.action = 'purchase' GROUP BY p.productName ORDER BY c DESC
  • 34. Streaming Data ● Calcite supports multiple windowing algorithms (e.g. tumbling, sliding, hopping) ● Streaming queries can be combined with tables ● Streaming queries can be optimized using the same rules along with new rules specifically for streaming queries
  • 35. Streaming Data ● Relations can be used both as streams and tables ● Calcite is a reference implementation for streaming SQL (still being standardized) SELECT STREAM * FROM Orders AS o WHERE units > (SELECT AVG(units) FROM Orders AS h WHERE h.productId = o.productId AND h.rowtime > o.rowtime - INTERVAL ‘1’ YEAR)
  • 36. Windowing Tumbling window Hopping window Session window SELECT STREAM … FROM Orders GROUP BY FLOOR(rowtime TO HOUR) SELECT STREAM … FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR) SELECT STREAM … FROM Orders GROUP BY HOP(rowtime, INTERVAL ‘1’ HOUR, INTERVAL ‘2’ HOUR) SELECT STREAM … FROM Orders GROUP BY SESSION(rowtime, INTERVAL ‘1’ HOUR)
  • 37. My Use Case ● Perform view-based query rewriting to provide a logical model over a denormalized data store ● Denormalized tables are views over (non-materialized) logical tables ● Queries can be rewritten from logical tables to the most cost-efficient choice of materialized views
  • 38. Use Cases ● Parsing and validating SQL (not so easy) ● Adding a relational front end to an existing system ● Prototyping new query processing algorithms ● Integrating data from multiple backends ● Allowing RDBMS tools to work with non-relational DBs
  • 39. Calcite Project Future Work ● Geospatial queries ● Processing scientific data formats ● Sharing data in-memory between backends ● Additional query execution engines
  • 40. My Future Work ● Better cost modeling ● Query-based data source selection ● Cost-based database system selection
  • 41. Contributing to Apache Calcite ● Pick an existing issue or file a new one and start coding! ● Mailing list is generally very active ● New committers and PMC members regularly added ● Many opportunities for projects at various scales
  • 42. Additional areas for contribution ● Testing (SQL is hard!) ● Incorporating state-of-the-art in DB research ● Access control across multiple systems ● Adapters for new classes of database (eg. array DBs) ● Implement missing SQL features (e.g. set operations) …
  • 43. Thanks to ● Edmon Begoli, Oak Ridge National Laboratory ● Jesús Camacho-Rodríguez, Hortonworks ● Julian Hyde, Hortonworks ● Daniel Lemire, Université du Québec (TÉLUQ) ● All other Calcite contributors!