Publicité

Druid and Hive Together : Use Cases and Best Practices

DataWorks Summit
21 Mar 2019
Publicité

Contenu connexe

Présentations pour vous(20)

Similaire à Druid and Hive Together : Use Cases and Best Practices(20)

Publicité

Plus de DataWorks Summit(20)

Dernier(20)

Publicité

Druid and Hive Together : Use Cases and Best Practices

  1. © Cloudera, Inc. All rights reserved. DRUID AND HIVE TOGETHER USE CASES AND BEST PRACTICES Nishant Bangarwa
  2. © Cloudera, Inc. All rights reserved. 2 AGENDA Motivation Introduction to Druid Hive and Druid Performance Numbers Demo
  3. © Cloudera, Inc. All rights reserved. 3 Database popularity trend in last 24 months
  4. © Cloudera, Inc. All rights reserved. 4 Challenges with specialized DBs • Each specialized DB has different dialects and API • Diverse security and audit mechanisms • Different governance models • Data from different sources needs to be combined at client side • Need a solution to provide performance without added complexity
  5. © Cloudera, Inc. All rights reserved. 5 Query Federation with Apache Hive Extensible Storage Handler • Input Format • Output Format • SerDe • Rules for pushing computations • Filters, Aggregates, Sort, Limit etc.. • Transform from SQL to special dialects
  6. © Cloudera, Inc. All rights reserved. 6 Introduction to Apache Druid High performance analytics data store for timeseries data
  7. © Cloudera, Inc. All rights reserved. 7 Companies Using Druid http://druid.io/druid-powered
  8. © Cloudera, Inc. All rights reserved. 8 When to use Druid ? • Event Data/ Timeseries data • Realtime – Need to analyze events as they happen. • Delays can lead to business loss e.g. Fraud Detection • High Data Ingestion rate • Scalable horizontally • Queries generally involve aggregations and filtering on time • Results for last quarter • Aggregate comparisons over time, this week compared to last week etc. • Result set is much smaller than the actual dataset being queried
  9. © Cloudera, Inc. All rights reserved. 9 Common Use Cases • User activity and behavior analysis • clickstreams, viewstreams and activity streams • measuring user engagement, tracking A/B test data for product releases, and understanding usage patterns • Application performance management • operational data generated by applications • identify bottlenecks and troubleshoot issues in Realtime • IoT and device metrics • Ingest machine generated data in real-time • optimize hardware resources, identify issues, anomaly detection. • Digital marketing • understand advertising campaign performance, click through rates, conversion rates
  10. © Cloudera, Inc. All rights reserved. 10 When NOT to use Druid ? • updating existing records using a primary key • updates need to be done via Rebuilding Segments (Re-Ingestion) • Queries involve dumping entire dataset • joining one big fact table to another big fact table • query latency is not very important for business use case • offline reporting system
  11. © Cloudera, Inc. All rights reserved. 11 Key Druid Features • Column-oriented Storage • Sub-Second query times • Arbitrary slicing and dicing of data • Native Search Indexes • Horizontally Scalable • Streaming and Batch Ingestion • Automatic Data Summarization • Time based partition • Flexible Schemas • Rolling Upgrades
  12. © Cloudera, Inc. All rights reserved. 12 Druid Concepts Time Based Partitioning 1. Time partitioned Segment Files 2. Segments are versioned to support batch overrides 3. By Segment Query Results are Cached Segment 5_1: version1 Friday Time Segment 1: version1 Monday Segment 2: version1 Tuesday Segment 3: version2 Wednesday Segment 4: version1 Thursday Segment 5_2: version1 Friday
  13. © Cloudera, Inc. All rights reserved. 13 Druid Architecture Realtime Nodes Historical Nodes Batch Data Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  14. © Cloudera, Inc. All rights reserved. 15 Apache Hive and Apache Druid • Large Scale Queries • Joins, Subqueries • Windowing Functions • Transformations • Complex Aggregations • Advanced Sorting • UDFs • Queries to power visualizations • Needles-in-a-haystack • Dimensional Aggregates • TopN queries • Timeseries Queries • Min/Max Values • Streaming Ingestion
  15. © Cloudera, Inc. All rights reserved. 16 Integration Benefits 1. Streaming Ingestion 2. Single SQL dialect and API 3. Central security controls and audit trail 4. Unified governance 5. Ability to combine data from multiple sources 6. Data independence
  16. © Cloudera, Inc. All rights reserved. 17 Druid data sources in Hive Registering Existing Druid data sources Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query
  17. © Cloudera, Inc. All rights reserved. 18 Druid data sources in Hive Creating Druid data sources Use Create Table As Select (CTAS) statement CREATE EXTERNAL TABLE druid_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.segment.granularity" = "DAY”) AS SELECT time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid segmentgranularity ⇢ Inference of Druid column types (timestamp,dimensions,metrics)dependson Hivecolumntype
  18. © Cloudera, Inc. All rights reserved. 19 Druid data sources in Hive File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Optional Data Summarization Original CTAS physical plan __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results File Sink Select Table Scan
  19. © Cloudera, Inc. All rights reserved. 20 Druid data sources in Hive File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Optional Data Summarization Rewritten CTAS physical plan CTAS query results File Sink Select Table Scan __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Reduce
  20. © Cloudera, Inc. All rights reserved. 21 Druid data sources in Hive Creating Streaming Druid data sources Use Create Table As Select (CTAS) statement CREATE EXTERNAL TABLE druid_streaming (`__time` timestamp,`dimension1` string`metric1` int, `metric2 double, Etc.. ) STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ( "druid.segment.granularity" = "DAY”, "kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "topic1"); Hive table name Hive storage handler classname Druid segmentgranularity Kafka related properties
  21. © Cloudera, Inc. All rights reserved. 22 Druid data sources in Hive Managing Streaming Ingestion from Hive Use Alter Table statement ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'START’); ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'STOP’); ALTER TABLE druid_streaming t SET TBLPROPERTIES('druid.kafka.ingestion' = 'RESET'); Hive table name Kafka related properties ⇢Reset will reset the offsets maintained by druid for ingestion
  22. © Cloudera, Inc. All rights reserved. 23 Querying Druid datasources • Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select) • Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator • Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results • It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  23. © Cloudera, Inc. All rights reserved. 24 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan
  24. © Cloudera, Inc. All rights reserved. 25 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Initial Plan: – Scan is executed in Druid (select query) – Rest of the query is executed in Hive Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Apache Hive Druid query
  25. © Cloudera, Inc. All rights reserved. 26 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive Druid query
  26. © Cloudera, Inc. All rights reserved. 27 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query select
  27. © Cloudera, Inc. All rights reserved. 28 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query select
  28. © Cloudera, Inc. All rights reserved. 29 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query groupBy
  29. © Cloudera, Inc. All rights reserved. 30 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query groupBy
  30. © Cloudera, Inc. All rights reserved. 31 Querying Druid datasources Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Apache Hive Druid query groupBy { "queryType": "groupBy", DruidJSON query "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations":[ { "type": "longSum","name":"s","fieldName":"c_added"} ], "limitSpec":{ "limit":10, "columns":[ {"dimension":"s","direction": "descending"} ] }, "intervals":[ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000"] } File Sink Druid Scan Query Physical Plan
  31. © Cloudera, Inc. All rights reserved. 32 Druid input format • Submits query to Druid and generates records out of the query results • Current version – Timeseries, TopN, and GroupBy queries are not partitioned directly sent to druid broker – Scan queries: realtime and historical nodes are contacted directly Timeseries, TopN, GroupBy Select Node Table Scan Record reader Table Scan Record reader Table Scan Record reader Node Node Table Scan Record reader … … … … Table Scan Record reader …
  32. © Cloudera, Inc. All rights reserved. 33 Performance and Scalability: Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Snap Inc) Largest Hourly Ingestion 2TB per Hour (Netflix)
  33. © Cloudera, Inc. All rights reserved. 34 Performance Numbers • Query Latency • average - 500ms • 90%ile < 1sec • 95%ile < 5sec • 99%ile < 10 sec • Query Volume • 1000s queries per minute • Benchmarking code • https://github.com/druid- io/druid-benchmark
  34. © Cloudera, Inc. All rights reserved. 35 Performance Numbers SSB Benchmark 1TB Scale
  35. © Cloudera, Inc. All rights reserved. 36 Useful Resources • Druid website – http://druid.io • Druid User Group - dev@druid.incubator.apache.org • Druid Dev Group – users@druid.incubator.apache.org • Hive Druid Integration - https://cwiki.apache.org/confluence/display/Hive/Druid+Integration • Blogs - https://hortonworks.com/blog/apache-hive-druid-part-1-3/ • Query Federation with Apache Hive - https://hortonworks.com/blog/query- federation-with-hive/
  36. © Cloudera, Inc. All rights reserved. THANK YOU
Publicité