Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Querying Druid in SQL with Superset

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 26 Publicité

Querying Druid in SQL with Superset

Télécharger pour lire hors ligne

Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.

Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Querying Druid in SQL with Superset (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

Querying Druid in SQL with Superset

  1. 1. Druid SQL Interface Calcite
  2. 2. Problem - Druid is used extensively on our team and at Oath - Druid is hard to interact with due to its JSON input format - Many at Oath are not familiar with how to optimize Druid queries
  3. 3. Why use Druid? - Able to ingest and serve data in real-time with low latency - Good for ad-hoc queries - Good for storing aggregate data - Scalable to ingest millions of events/sec
  4. 4. Using SQL to bridge the gap ● SQL is the lingua franca of data ● Most at Oath are already familiar with SQL ● SQL is easier to write and more concise than JSON ● All BI tools we use support SQL
  5. 5. SQL vs Druid JSON Here is a sample SQL query for a given dataset: SELECT SUM("store_sales") filter (where "store_state" = 'CA'), SUM("store_cost") filter (where "store_state" = 'OR') FROM "foodmart" WHERE "the_month" == 'October' LIMIT 10 The same query in Druid JSON format is much less readable
  6. 6. SQL vs Druid JSON { "queryType":"groupBy", "dataSource":"foodmart", "granularity":"all", "dimensions":[], "limitSpec":{ "type":"default", "limit":10, "columns":[] }, "filter":{ "type":"and", "fields":[ { "type":"or", "fields":[ { "type":"selector", "dimension":"store_state", "value":"CA" }, { "type":"selector", "dimension":"store_state", "value":"OR" } ] }, { "type":"not", "field":{ "type":"selector", "dimension":"the_month", "value":"October" } } ] }, "aggregations":[ { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"CA" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$0", "fieldName":"store_sales" } }, { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"OR" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$1", "fieldName":"store_cost" } } ], "intervals":["1900-01-09T00:00:00.000/2992- 01-10T00:00:00.000"] }
  7. 7. Pre-existing Solutions - Druid SQL services - Hive Druid connection - Apache Calcite
  8. 8. Druid SQL Services - Druid has SQL support via Apache Calcite - Pros: - Significantly simplifies query JSON - Already supported in Druid - Cons: - Support is experimental - Doesn’t support DataSketch aggregators curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json { "query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01- 01 00:00:00'", "context" : {"sqlTimeZone" : "America/Los_Angeles"} }
  9. 9. Hive Druid Connection - Hive also has some level of Druid support via Apache Calcite - Pros: - Many BI tools already support Hive - Cons: - Lacks support for sketches
  10. 10. Apache Calcite - Translator between SQL and Druid JSON - Industry-standard SQL parser - Represent your query in relational algebra, transform using planning rules, and optimize according to a cost model - Open source
  11. 11. Our Solution - Use Apache Calcite directly - Address the deficiencies of Calcite and contribute back to the open source community
  12. 12. Calcite relational algebra - Relational logic tree translated from SQL query - Each node has its cost based on context - SELECT SUM(a) as c FROM table1 WHERE b=1 ORDER BY c TableScan On table1 Filter (b=1) Project (table1.a -> a) Aggregate (sum(a)) Sort on c
  13. 13. Query Planning - Apply rules on the Relational logic tree - Transform certain logic subtree into Druid Query Node TableScan Filter Project Aggregate Sort Druid GroupBy Query node Sort Druid TopN Query node Or
  14. 14. Optimization - Use cost model to estimate the performance of different transformed logic tree - Basic idea is to leverage more computation in Druid Druid GroupBy Query node Sort Druid TopN Query node Cost = 10 Cost = 10 Cost = 15
  15. 15. Renderer - Render the druid json query to be sent out - If any computation cannot be pushed to json query, run it locally in Calcite. Druid TopN Query node { "queryType":"TopN", "dataSource":"foodmart", "Granularity":"all", …
  16. 16. Major Problems - Did not support Post-Aggregation - AVERAGE function - Could run out of memory - Did not support Filtered Aggregations - Could cause Druid query all rows and process them in memory - Did not support Distinct Count Aggregators using ThetaSketches - Calcite will always try to give the user exact results - Distinct count aggregations are not pushed to Druid
  17. 17. Post Aggregation Support - New Rule to merge Post aggregation node - New Render that can generate druid query with post aggregation TableScan Project Aggregate Druid GroupBy Query node Aggregate Aggregate Druid GroupBy Query node New Rule
  18. 18. Filtered Aggregations Support - New Rule to move Filter operation from Calcite to Druid - Optimization on filters - New rule to extract common filter into outer filter - New rule to combine filter with logical ORs to outer filter TableScan Filter1 Project Aggregate Aggregate Filter2 Filter2 TableScan Filter1 Project Aggregate AggregateFilter2
  19. 19. Performance - Avoid unnecessary rows scan in Druid - Greatly reduce the runtime of when filters are involved
  20. 20. Why ThetaSketch - Sketches are a class of streaming, stochastic algorithms - Trade off accuracy for speed – orders of magnitude faster - Exact up to configurable thresholds and approximate after - Mathematically provable error bounds - Bounded in space - Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io
  21. 21. ThetaSketch Support - New rule to translate Distinct count aggregator node to Thetasketches node - Allow users to config whether approximate cardinality is allowed
  22. 22. Performance - Reduced the running time of the query with count distinct aggregator when cardinality estimation is allowed - Sketches column can be utilized now - With post aggregation support, more operation can be applied
  23. 23. User Interface - Superset is commonly used with Druid - Superset SQL Lab is popular on SQL-like database From superset documentation: https://superset.incubator.apache.org/
  24. 24. Superset Calcite Connection - Superset is python application - Standard python DBAPI is created - Able to use SQL lab to run ad-hoc query on Druid
  25. 25. Perform Query Parsing, Planning Internal Computation Druid Adapter SQL Lab Calcite JDBC Output User Superset Calcite Druid
  26. 26. Questions

Notes de l'éditeur

  • Druid emerges recently as a is an open-source data store designed for sub-second queries on real-time and historical data. It is usually used to query event data. It can ingest event data in real time and allows flexible data exploration and data aggregation. Right now in Oath, a lot of data sciences want to do a ad-hoc query on users event data and Druid will be their first choice. Now, we will go over how to work with druid when we need to do a adhoc query.
  • In this presentation, we will first go over the reason why we want a SQL interface on the top of Druid. In the meantime, we will briefly introduce Druid and other existing solution. After that, we will focus on the improvement needed to be done for the SQL interface. This part will include our contribution on the open source project. At last, we will introduce how we combine the SQL interface with a neat User interface for a wider range of users to work on Druid.
  • Let’s say I am a data scientist and I may want to run this SQL on druid. In this SQL query, we are looking for summation of store_sales in California and store_costs in Oregon. The data in October should be excluded. Since now the data is in druid, we will have to write a query that Druid understands. In this case, we have to write a JSON and send it to druid within a http request. So how the json will be like?
  • Probably after several experience with druid, ones will be familiar with this format. However, for someone who is already familiar with SQL to translate SQL into this format, it will need some time to learn the druid documentation and probably the syntax of json. It will be great if we can have a SQL interface to query druid data without losing the great performance of druid. Let’s go ahead and explore some SQL interfaces that Druid could work with.
  • First option we can have is to use the SQL services provided by Druid. It works similarly as the druid query and we need to send a http request with the SQL statement in a json object. This function is supported by a open source tool called Apache Calcite.
  • Need more
  • Obviously, Calcite is the core part of the SQL interface on Druid. To develop a SQL interface on Druid, it is necessary to learn about Calcite first. We will briefly introduce Calcite and how it works then go over the contribution we made on Open source Calcite.
  • The most important concept in Calcite is the relational algebra. Calcite basically translate SQL statement into a tree structure with nodes representing the logic in it. For example, the tree on the left side is a simple example algebra tree for the SQL statement on the left side. First, the TableScan to lock down the SQL on certain table. Then the filter contains the logic of WHERE value of b column equals to 1. The Project node is not so obvious but in many SQL database the actual column name may contain namespace and the project node will take care these kinds of translation. The Aggregate node includes the logic of summation function we used. The last one is the sort node which corresponds to order by part in SQL statement.
  • With the relational logic tree, we now can transform the tree to another tree representing the equivalent logic. After transformation, the new tree should be translated into Druid Query. The transformation rules will subtrees with equivalent logic to be transformed between each other. Like the example here, the first four node can be transformed as a druid groupby query through rules we specified. In the other transformation path, the whole tree can be transformed into Druid topN query node which contains equivalent logic. Now, the question is to pick a certain tree as our final result.
  • The final result will be determined by the optimizer in Calcite. In calcite, each node will have cost to quantify the computation power it required to run the logic in the node. Different output trees, therefore, will have different cost. The result will be the final tree with minimum cost. Back to our example, the cost of final tree on the top is 10 + 10 20, but another output tree only have one node with 15 cost. The optimizer will then pick the Druid TopN query node as the final result. Now, the final job is to render the druid json query with the logic we have in the druid query node.
  • Now, the final job is to render the druid json query with the logic we have in the druid query node. Basically it is like a json writer that generate the query based on the information we had in the node.
  • The whole idea of Calcite is brilliant, but at the time when we try to use it, it still has missing piece that can affect the performance. We the decided to contribute to the open source repository to enhance Calcite. These are two major problems we worked on. First, Calcite in that time did not have support on post-aggregation, so part of the computation in function like AVERAGE will be performed in Calcite instead of Druid. Moving those computation to Druid can save memory and time when running query. The second one is the support of filtered Aggregation. Without the support, calcite may assign druid to query all rows and do the filter in Calcite local machine. This could cause a huge fallback in performance when filtered aggregation is involved.
  • To add post aggregation support, we add new rules to merge post aggregation node in supported druid query node, so the final tree can include post aggregation logic. Also, a new renderer is needed to render query with post aggregator. Now, when possible, the post aggregation will be pushed to Druid query node.
  • Similar as the post aggregation support, we add new rules to deal with the filter node right before the aggregate node. The general idea here is to move inner filter to outer filter

×