Querying Druid in SQL with Superset

Problem
- Druid is used extensively on our team and at Oath
- Druid is hard to interact with due to its JSON input format
- Many at Oath are not familiar with how to optimize Druid queries

Why use Druid?
- Able to ingest and serve data in real-time with low latency
- Good for ad-hoc queries
- Good for storing aggregate data
- Scalable to ingest millions of events/sec

Using SQL to bridge the gap
● SQL is the lingua franca of data
● Most at Oath are already familiar with SQL
● SQL is easier to write and more concise than JSON
● All BI tools we use support SQL

SQL vs Druid JSON
Here is a sample SQL query for a given dataset:
SELECT
SUM("store_sales") filter (where "store_state" = 'CA'),
SUM("store_cost") filter (where "store_state" = 'OR')
FROM
"foodmart"
WHERE
"the_month" == 'October'
LIMIT
10
The same query in Druid JSON format is much less readable

SQL vs Druid JSON
{
"queryType":"groupBy",
"dataSource":"foodmart",
"granularity":"all",
"dimensions":[],
"limitSpec":{
"type":"default",
"limit":10,
"columns":[]
},
"filter":{
"type":"and",
"fields":[
{
"type":"or",
"fields":[
{
"type":"selector",
"dimension":"store_state",
"value":"CA"
},
{
"type":"selector",
"value":"OR"
}
]
},
{
"type":"not",
"field":{
"type":"selector",
"dimension":"the_month",
"value":"October"
}
}
]
},
"aggregations":[
{
"type":"filtered",
"filter":{
"type":"selector",
"value":"CA"
},
"aggregator":{
"type":"doubleSum",
"name":"EXPR$0",
"fieldName":"store_sales"
}
},
{
"type":"filtered",
"filter":{
"type":"selector",
"value":"OR"
},
"aggregator":{
"type":"doubleSum",
"name":"EXPR$1",
"fieldName":"store_cost"
}
}
],
"intervals":["1900-01-09T00:00:00.000/2992-
01-10T00:00:00.000"]
}

Pre-existing Solutions
- Druid SQL services
- Hive Druid connection
- Apache Calcite

Druid SQL Services
- Druid has SQL support via Apache Calcite
- Pros:
- Significantly simplifies query JSON
- Already supported in Druid
- Cons:
- Support is experimental
- Doesn’t support DataSketch aggregators
curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json
{
"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01-
01 00:00:00'",
"context" : {"sqlTimeZone" : "America/Los_Angeles"}
}

Hive Druid Connection
- Hive also has some level of Druid support via Apache Calcite
- Pros:
- Many BI tools already support Hive
- Cons:
- Lacks support for sketches

Apache Calcite
- Translator between SQL and Druid JSON
- Industry-standard SQL parser
- Represent your query in relational algebra, transform using planning rules,
and optimize according to a cost model
- Open source

Our Solution
- Use Apache Calcite directly
- Address the deficiencies of Calcite and contribute back to the open source
community

Calcite relational algebra
- Relational logic tree translated
from SQL query
- Each node has its cost based on
context
- SELECT SUM(a) as c FROM
table1 WHERE b=1 ORDER BY c
TableScan On table1
Filter (b=1)
Project (table1.a -> a)
Aggregate (sum(a))
Sort on c

Query Planning
- Apply rules on the
Relational logic tree
- Transform certain logic
subtree into Druid Query
Node
TableScan Filter Project Aggregate Sort
Druid GroupBy Query node Sort
Druid TopN Query node
Or

Optimization
- Use cost model to estimate the
performance of different
transformed logic tree
- Basic idea is to leverage more
computation in Druid
Druid GroupBy Query node Sort
Cost = 10 Cost = 10
Cost = 15

Renderer
- Render the druid json query to be
sent out
- If any computation cannot be
pushed to json query, run it locally
in Calcite.
{
"queryType":"TopN",
"dataSource":"foodmart",
"Granularity":"all",
…

Major Problems
- Did not support Post-Aggregation
- AVERAGE function
- Could run out of memory
- Did not support Filtered Aggregations
- Could cause Druid query all rows and process them in memory
- Did not support Distinct Count Aggregators using ThetaSketches
- Calcite will always try to give the user exact results
- Distinct count aggregations are not pushed to Druid

Post Aggregation Support
- New Rule to merge Post
aggregation node
- New Render that can
generate druid query with
post aggregation
TableScan Project Aggregate
Druid GroupBy Query node
Aggregate
Aggregate
Druid GroupBy Query node
New Rule

Filtered Aggregations Support
- New Rule to move Filter
operation from Calcite to
Druid
- Optimization on filters
- New rule to extract
common filter into outer
filter
- New rule to combine filter
with logical ORs to outer
filter
TableScan Filter1 Project
Aggregate
Aggregate
Filter2
Filter2
TableScan
Filter1
Project
Aggregate
AggregateFilter2

Performance
- Avoid unnecessary rows scan
in Druid
- Greatly reduce the runtime of
when filters are involved

Why ThetaSketch
- Sketches are a class of streaming, stochastic
algorithms
- Trade off accuracy for speed – orders of
magnitude faster
- Exact up to configurable thresholds and
approximate after
- Mathematically provable error bounds
- Bounded in space
- Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io

ThetaSketch Support
- New rule to translate Distinct count aggregator node to Thetasketches node
- Allow users to config whether approximate cardinality is allowed

Performance
- Reduced the running time of the
query with count distinct aggregator
when cardinality estimation is
allowed
- Sketches column can be utilized
now
- With post aggregation support,
more operation can be applied

User Interface
- Superset is commonly used
with Druid
- Superset SQL Lab is popular
on SQL-like database
From superset documentation: https://superset.incubator.apache.org/

Superset Calcite Connection
- Superset is python application
- Standard python DBAPI is created
- Able to use SQL lab to run ad-hoc query on Druid

Perform Query
Parsing,
Planning
Internal
Computation
Druid Adapter
SQL Lab
Calcite JDBC
Output
User
Superset
Calcite
Druid

Querying Druid in SQL with Superset

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Querying Druid in SQL with Superset

Similaire à Querying Druid in SQL with Superset (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Querying Druid in SQL with Superset

Notes de l'éditeur