Contenu connexe

Présentations pour vous(20)

Similaire à HAWQ Meets Hive - Querying Unmanaged Data(20)

Plus de DataWorks Summit(20)

HAWQ Meets Hive - Querying Unmanaged Data

  1. 1© 2017 Pivotal Software, Inc. All rights reserved. 1© 2017 Pivotal Software, Inc. All rights reserved. Querying Unmanaged Data HAWQ meets Hive Shivram Mani Oleksandr Diachenko
  2. 2© 2017 Pivotal Software, Inc. All rights reserved. Agenda ● Overview of Apache HAWQ (incubating) ● HAWQ Architecture ● HAWQ Extension Framework ● HAWQ Hive Integration ● HAWQ HCatalog Integration
  3. 3© 2017 Pivotal Software, Inc. All rights reserved. Apache HAWQ’s Lineage 1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 Postgres developed at UC Berkeley Postgres adds support for SQL Open Source PostgreSQL PostgreSQL 7.0 released PostgreSQL 8.0 released Greenplum based on PostgreSQL Hadoop 1.0 Released HAWQ goes open-source (Apache) HAWQ project launched Hadoop 2.0 Released
  4. 4© 2017 Pivotal Software, Inc. All rights reserved. HAWQ Overview Multi-level Fault Tolerance Granular Authorization Resource Mgmt (+ YARN) Multi-tenancy + Security ANSI SQL Standard OLAP Extensions JDBC ODBC Connectivity Online Expansion Hadoop / HDFS Operations Cost Based Optimizer (ORCA) Dynamic Pipelining ACID + Transactional MPP Architecture Data Federation Language Extensions Advanced Analytics MPP Database for Enterprises Extensibility HDFS Native File Formats Compression + Partitioning Core Connectivity - Enable Data Science - Large Scale Analytics - Query All Data Types & sources - Manage Multiple Workloads - Security controls - Well Integrated - Leverage Existing SQL Skills & BI Tools - High-performance Ambari Management Machine Learning
  5. 5© 2017 Pivotal Software, Inc. All rights reserved. HAWQ Components HAWQ Master (1) Metadata Transaction Mgr. Query Parser Query Optimizer Resource Mgr. NN cache Query Dispatch Fault Tolerant Svc HAWQ Segment (1..N) Postmaster Local directory (Temp Data / Logs) Virtual Segments (Query Executors) libhdfs3 Datanode YARN NM HAWQ Standby Master (1)
  6. 6© 2017 Pivotal Software, Inc. All rights reserved. Server NServer 2Server 1 Query Execution (Native) HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Resource Mgr. NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Local directory Local directory Local directory Animated slides NN Cache Interconnect
  7. 7© 2017 Pivotal Software, Inc. All rights reserved. Server NServer 2Server 1 Query Execution - Plan HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer NN Cache Resource Mgr. NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Query Dispatch Local directory Local directory Local directory
  8. 8© 2017 Pivotal Software, Inc. All rights reserved. Server NServer 2Server 1 Query Execution - Resource HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer NN Cache Resource Mgr. NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Query Dispatch VS VS VS VS VS Local directory Local directory Local directory I need 5 containers Each with 1 CPU core and 1 GB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers VS = Virtual Segment (container for Query Executors) # of QEs in a v-seg = # of slices in a query
  9. 9© 2017 Pivotal Software, Inc. All rights reserved. Query Execution - Prepare HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer NN Cache Resource Mgr. NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Query Dispatch VS VS VS VS VS Server 1 Local directory Server 2 Local directory Server N Local directory VS = Virtual Segment (container for Query Executors) # of QEs in a v-seg = # of slices in a query
  10. 10© 2017 Pivotal Software, Inc. All rights reserved. Query Execution - Execute HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer NN Cache Resource Mgr. NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Query Dispatch VS VS VS VS VS Server 1 Local directory Server 2 Local directory Server N Local directory VS = Virtual Segment (container for Query Executors) # of QEs in a v-seg = # of slices in a query
  11. 11© 2017 Pivotal Software, Inc. All rights reserved. Query Execution - Result HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer NN Cache Resource Mgr. NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Query Dispatch VS VS VS VS VS Server 1 Local directory Server 2 Local directory Server N Local directory VS = Virtual Segment (container for Query Executors) # of QEs in a v-seg = # of slices in a query
  12. 12© 2017 Pivotal Software, Inc. All rights reserved. Highly efficient MPP (massively parallel processing) heritage and architecture Dynamic pipelining, no intermediate writes to disk Advanced cost-based optimizer Scalable and fast Interconnect Native (C++) HDFS access/scan speed HDFS metadata cache Optimal data locality matching methods Reasons why HAWQ is high-performance
  13. 13© 2017 Pivotal Software, Inc. All rights reserved. seconds * Queries that did not complete are omitted from results on both platforms • HAWQ ~1.3x faster • Competing MPP Hadoop engine failed to complete 47% of the queries (unmodified) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 Unsupported SQL Long running killed Memory Limit Exceeded Test Query Failed in the other engine TPC-DS Queries with 5-Users TPC-DS benchmark
  14. 14© 2017 Pivotal Software, Inc. All rights reserved. Managed vs Unmanaged data Managed data Unmanaged data Metadata Metadata ???
  15. HAWQ eXtension Framework (aka PXF) Uniform tabular view to heterogeneous data sources Exploits parallelism for data access Pluggable framework for Custom connectors(profiles) Built-in connectors for various data sources/formats
  16. Tomcat (Webapp) REST API Java API External Tables Java API Java/Thrift ● JDBC ● Solr ● Redis ● Cassandra ● GemfireXD PXF Architecture ➔ Independent JVM ➔ Runs alongside namenode and datanodes PXF
  17. 17© 2017 Pivotal Software, Inc. All rights reserved. Server NServer 2Server 1 Query Execution (External Data) HAWQ Master NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode Postmaster Local directory Local directory Local directory Animated slides
  18. 18© 2017 Pivotal Software, Inc. All rights reserved. Server NServer 2Server 1 Query Planning - Distribution HAWQ Master NameNode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode Postmaster PXF Local directory Local directory Local directory Get Partition Metadata {P1, P2, P3, P4, P5} Planner Partition Mapper {P1, P4} {P5} {P2, P3}
  19. 19© 2017 Pivotal Software, Inc. All rights reserved. Server NServer 2Server 1 Query Execution - Read HAWQ Master HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode Postmaster VS VSVS VS VS NameNode PXF PXF PXF PXF P2P5P1 P4 P3
  20. 20© 2017 Pivotal Software, Inc. All rights reserved. Query Execution - Result HAWQ Master HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode HAWQ Segment Postmaster HDFS Datanode VS VS VS VS VS Server 1 Local directory Server 2 Local directory Server N Local directory VS = Virtual Segment (container for Query Executors) # of QEs in a v-seg = # of slices in a query NameNode PXFPostmaster Global Aggregate
  21. 21© 2017 Pivotal Software, Inc. All rights reserved. HAWQ-Hive Data Integration HiveRC ➢ Works for RCFile format Hive ➢ Works for heterogeneous tables ➢ Support all formats ➢ Unmooptimized HiveText ➢ Works fast for text data ➢ Lazy data resolution ➢ Only text datatypes are supported HiveORC ➢ Optimized for ORC data ➢ Leverages predicates push down ➢ Column projection HiveVectorizedORC ➢ Uses ORC Batch API ➢ Sends 1024 row batch to HAWQ ➢ Enables Vectorized Execution
  22. 22© 2017 Pivotal Software, Inc. All rights reserved. HAWQ-Hive ORC Optimizations HAWQ Master HAWQ Segment Postmaster PXF column attributes: col1, col2 predicate: RPNF {filter(s)} aggregate functions {Col1,col2 col3=’abc’} col4; col3; col2; col1; SELECT col1,col2 FROM tab1 WHERE col3 = ‘abc’; SELECT COUNT(*) FROM tab1 WHERE col3 = ‘abc’;Query Dispatch ORC API {Col1,col2 col3=’abc’}
  23. 23© 2017 Pivotal Software, Inc. All rights reserved. Optimizations Statistics ● Exposing statistics about unmanaged tables ● Optimized Query plan Columns projection ● Passing requested columns ● Disk I/O is optimized if data format allows Predicates pushdown ● Passing down predicates from WHERE clause through the PXF framework ● Partitions/stripes/files elimination Batches vs tuples ● HiveText ● HiveVectorizedORC ● Lazy Data resolution
  24. 24© 2017 Pivotal Software, Inc. All rights reserved. HAWQ-Hive Catalog Integration CREATE EXTERNAL TABLE items (column2 int, column2 string) LOCATION ('pxf://namenode:51200/customer_db?PROFILE=Hive') FORMAT 'custom' (formatter='pxfwritable_import'); SELECT * FROM items; Was: Wanted: ● Need to create external HAWQ table ● Users need to know HAWQ-Hive data mapping ● Need to keep both tables metadata in sync manually SELECT * FROM items; ● No need to create external HAWQ table ● Users don't know about HAWQ-Hive data types mapping, etc ● Metadata is always up to date
  25. 25© 2017 Pivotal Software, Inc. All rights reserved. Challenges with Catalog Unification Hive Catalog
  26. 26© 2017 Pivotal Software, Inc. All rights reserved. Challenges with Catalog Unification HAWQ Catalog
  27. 27© 2017 Pivotal Software, Inc. All rights reserved. Where to store HCatalog data in HAWQ Requires few HAWQ changes Getting all catalog utilities for free Catalog is polluted with external data HCatalog objects are visible to concurrent sessions Session-level isolation Cheap cleanup process HAWQ Catalog service need to be changed to be able to work with disk/memory Catalog utilities need to be modified to work with HCatalog objects
  28. 28© 2017 Pivotal Software, Inc. All rights reserved. Object namespaces 0 2^3210*2^20 Globalcounter Session 1 counter In-memory In-memory In-memory Session 2 counter Session N counte HAWQ objects HCatalog objects Persistant Sessions states are isolated
  29. 29© 2017 Pivotal Software, Inc. All rights reserved. HAWQ-HCatalog Integration Weblogs id double ts timestamp ... SELECT * FROM hcatalog.default.weblogs WHERE ts between ‘2015-09-01’ and ‘2015-09-30’; HIVE PXF PXF PXF HCAT SELECT COUNT(*) FROM hcatalog.default.weblogs WHERE ts between ‘2015-09-01’ and ‘2015-09-30’; In Memory Catalog Disk Heap Catalog Weblogs id double ts timestamp ... HAWQCatalogservice HAWQ
  30. 30© 2017 Pivotal Software, Inc. All rights reserved. Avoid data duplication: All processing engines point to the same copy of data ⬢ Apache HAWQ ● MPP engine from the core ● Easy transition from Tradition DB/Warehouse ● Ad-hoc Analytics, BI & Visualization ● Low Query Latency ● Scale 100s TB to low PB’s ● Machine Learning (Madlib) Apache Hive & HAWQ (via HDB) The Most Comprehensive SQL on Hadoop Right Tool for the Job: Choose the right SQL engine based on your application’s needs. ⬢ Apache Hive ● Holds very detailed information ● Integrates all data sources ● Low-Mid Query Latency ● Scales to 100’s petabytes ● Large Community Run HAWQ & Hive alongside!
  31. github.com/apache/incubator-hawq HAWQ Homepage Getting Started HAWQ Wiki PXF Wiki Sandbox Additional Resources Documentation Wiki/Docs Code Github(Apache) Join Discussion/Ask Questions Apache DLs dev@hawq.incubator.apache.org user@hawq.incubator.apache.org
  32. Additional Slides
  33. 33© 2016 Pivotal Software, Inc. All rights reserved. LIBYARNResourceBroker libyarn Resource pool YARNResourceManager segments YARN Node Manager HAWQ Segment Register HAWQ as an unmanaged application exclusively consuming a YARN queue Periodically fetch YARN cluster report, container report and queue report to recognize YARN cluster Acquire YARN containers with host preference information Return YARN containers Unregister HAWQ in YARN Add activated YARN containers’ quota Return YARN containers’ quota Global RM container Lifecycle Manager Resourcebrokeruseslibyarn(ac/c++ versionlibrary)tocommunicatewith YARNthroughprotobuf. Indexed Resource Quota Table Accepted YARN container quota To be returned YARN containers’ quota Increase HAWQ segment resource quota when have new global resource manager’s containers allocated; Decrease HAWQ segment resource quota when some global resource manager’s containers are decided to be kicked. HAWQ resource queue manager Acquire calculated resource quota or return unused query resource HAWQ Query Dispatcher Acquire/Returnqueryresource SQL statement Container report Cluster report Queue report Query Quota Calculator Query Resource Request Queuing Facility HAWQ Resource Manager Queue Quota Calculator Allocated query resource Allocatedqueryresource Active YARN containers with resource holding processes started Drive resource broker to acquire global resource manager containers. The quota of a global resource manager can be (1GB,1core), (2GB, 1core), etc. Allocate virtual segments with fixed resource quota assigned and dispatch workload to segments. The resource quota can be as small as 128MB, 256MB and as large as GBs. 4 79 10 11 14 15 8 312 6 5 1 2 13 Internal Use Only
  34. 34© 2016 Pivotal Software, Inc. All rights reserved. • Responsibility – Responsible for acquiring & returning CPU/Mem resources from/to YARN – Responsible for resource allocation among HAWQ users and queries • Master resource manager process – Resource negotiation with YARN and resource allocation – Manage and maintain the resources in resource pool – Handle resource allocation/return RPC requests from QD (query dispatcher) – Fault tolerance service are in the same process • Segment resource manager process – One HAWQ RM on each Segment – Negotiation with Master resource manager (for resource enforcement) – Fault tolerance service: Heartbeat sender Resource Management HAWQ Resource Manager
  35. 35© 2016 Pivotal Software, Inc. All rights reserved. SQL on Hadoop benchmark
  36. 36© 2016 Pivotal Software, Inc. All rights reserved. PXF Data Flow
  37. 37© 2016 Pivotal Software, Inc. All rights reserved. PXF Data Model
  38. 38© 2016 Pivotal Software, Inc. All rights reserved. Putting it all together External Data pxf Parallelized access to external data sources (read/write) Install and Configure Ambari to deploy and manage HAWQ, just like any other Hadoop service. Manage Resources YARN-integrated for dynamic resource allocation across hierarchical groups. Write Queries Advanced optimizer and dynamic pipelining for high-performance response.orca Enable Data Science In-database machine learning algorithms for predictive analytics. Extend Data Processing Procedural language extensions for custom application logic. Summary of HAWQ user experience (via HDB)