Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Standalone metastore-dws-sjc-june-2018

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 29 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Standalone metastore-dws-sjc-june-2018 (20)

Publicité

Plus récents (20)

Publicité

Standalone metastore-dws-sjc-june-2018

  1. 1. Sharing Metadata Across the Data Lake and Streams Alan F. Gates Co-founder Hortonworks, Member Apache Hive PMC June 2018
  2. 2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
  3. 3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Metadata in SQL
  4. 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Big Data SQL Engines There are many big data SQL engines: Hive, Spark, Impala, Presto, … Hive Impala Presto Spark
  5. 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Pro Pluribus Unum There are many big data SQL engines: Hive, Spark, Impala, Presto, … These engines all store their metadata in the Hive Metastore Hive Metastore Hive Impala Presto Spark
  6. 6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Good, … Hive Metastore These engines all store their metadata in the Hive Metastore Good: Shared metadata makes sharing data between engines easier Hive Impala Presto Spark There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  7. 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Bad, … Hive Metastore These engines all store their metadata in the Hive Metastore Hive Impala Presto Spark Bad: Non-Hive systems have to install much of Hive to get the Metastore Bad: Hard for other projects to contribute to the Metastore Good: Shared metadata makes sharing data between engines easier There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  8. 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved And the First Proposal Metastore These engines all store their metadata in the Hive Metastore Hive Impala Presto Spark Proposal: Separate the Metastore from Hive Good: Shared metadata makes sharing data between engines easier Bad: Non-Hive systems have to install much of Hive to get the Metastore Bad: Hard for other projects to contribute the Metastore There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  9. 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Breaking out the Metastore  Enables the Metastore to continue to be used by many engines  In Hive 3.0 the Metastore was released as a separate module  Can be installed and run without the rest of Hive – A few features missing when Hive not present: e.g. the compactor – Planning to add these in the future  Backwards compatibility maintained for Thrift clients – Older version clients can talk to the new, separate, metastore  A few small changes for server hook implementations  There is a proposal to make it a separate Apache project – Will enable better collaboration with non-Hive projects – Still in discussion with the Hive PMC on this
  10. 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Enables Shared Metadata in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance
  11. 11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Is this HCatalog 2.0?  Didn’t we do this before? Wasn’t it called HCatalog?  No, HCatalog is different  HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other applications – Includes metadata access – Also includes data access (serdes, object inspectors, and input/output formats)  Metastore stores metadata, including which serdes etc. to use but does not provide readers and writers  HCatalog stays with Hive in this split, it does not go with the Metastore – Because it includes the data access
  12. 12. 12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Schemas in Streams
  13. 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example: Hortonworks Schema Registry  Provides a central repository for messages’ metadata  Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)  Can be used by any application via REST interface  Schema defined in JSON  Schema is tied to a Kafka topic or NiFi flow  Every schema has a name: e.g. temp_sensor_data  Schemas can have one or more versions – Different messages in a topic will have different versions of the schema – Compatibility between schema versions can be none, backwards, forwards, or both  Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED  Serdes stored with schema so system knows how to (de)serialize data
  14. 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example Schema Registry Schema { "name": "temp_sensor_data", "fields": [ { "name": "sensorId", "type": "long"}, { "name": "location", "type": "record", "fields": [ { "name": "longitude", "type": "double"}, { "name": "latitude", "type": "double"} ]}, { "name": "temperature", "type": "int"}, { "name": "readAt", "type": "long"} ] }
  15. 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Contrasting SQL and Registry Schemas SQL Schema Registry Schema tied to a table Schema tied to a Kafka topic or NiFi flow Schema applies to all records in a partition Records in a topic may have different versions of the schema, with no given order Schema defined in SQL DDL CREATE TABLE T (A INT, B VARCHAR(20)); Schema defined in JSON Primary access is via SQL for users and Thrift for SQL engines Primary access is via UI for developers and Java/REST for streaming applications Supports standard SQL types and Java types Supports Java types No concept of schema lifecycle Schema lifecycle management via schema version state
  16. 16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Bringing the Strands Together
  17. 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved First Problem  Administrators have another system to install, monitor, update, …  Developers must maintain two systems whose basic functionality, record & serve runtime metadata, is the same  Other systems that want to integrate with runtime metadata, security systems like Ranger and Sentry and governance systems like Atlas, have to integrate with each component separately With both the Hive Metastore and the Schema Registry we are adding yet another component to the system
  18. 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Second Problem  Sometimes your streaming application will want to read from a table – It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka stream  Sometimes your query will want to read from a stream – It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka stream  To share data today tools have to be able to read data using both paradigms Hardwiring a perspective into a metadata repository makes it harder to share data between applications
  19. 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Second Proposal: Cross the Streams  Put the Schema Registry on top of the Metastore  It will still support SQL and streaming perspectives  One system means less work for admins, developers, and other tools  One system with multiple perspectives means – streaming tools can view data as a stream whether it is in Kafka or Hive – batch tools can view data as a table whether it is in Hive or Kafka
  20. 20. 20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Streaming Application Reading from a Table Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • A stream userEvents • An application that flags users who have called support in the last 24 hours Hive table support_calls userid long calltime timestamp summary string supportCalls Schema: { "group": "hive", "fields": [{ "userid": "long", "calltime": "timestamp", "summary" : "string" }] } • App can cache table every hour, do a join as events arrive to flag users who need extra attention • Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka • Because HMS and SR are unified, streaming apps can view this as an SR Schema Example: • Hive has record of support calls, Kafka does not
  21. 21. 21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Query Reading from a Stream Hive table user_events, partitioned by event_hour user_id long event_type varchar(256) event_hour datetime Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • Hive table user_events is loaded every hour from Kafka topic userEvents Example: • Because HMS and SR are unified, Hive can view Kafka topic as partition of its table Hive table user_events, partition event_hour='latest' • Hive queries can now read Kafka topic userEvents as a partition of user_events • Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few seconds from Kafka • Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
  22. 22. 22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Some Assembly Required  Need to bridge the gaps between SQL and Registry schemas - Nontrivial – Schema consistent for all records in a partition versus different schema versions in the stream – SQL types versus Java types – Schema as an attribute of a table versus as a first class object with version and lifecycle  Will require connectors so streaming apps can use batch serdes and vice versa  Work in progress: – https://github.com/apache/hive/pull/347 – https://github.com/apache/hive/pull/348 – https://github.com/apache/hive/pull/349 – https://issues.apache.org/jira/browse/HIVE-19521 – https://issues.apache.org/jira/browse/HIVE-19522
  23. 23. 23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Can We Share Too Much?
  24. 24. 24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Yes, Yes We Can Example use case: Hive LLAP being used for analytics, Spark for ETL Metastore LLAP Spark I have been extolling the benefits of a shared Metastore for the last 20 slides, so clearly we want to share one instance between them But, • Hive and Spark can't always read each other's data • e.g. Spark can't read Hive's ACID tables • Different use cases require different security models • e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger • Different defaults are appropriate for different use cases • e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog
  25. 25. 25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Third Proposal: Add Catalogs  Catalog is standard SQL top level container  Catalogs contain databases, thus fully addressing a table will become catalog.database.table  Default catalog 'hive' added in 3.0, and all existing databases placed in it  In 3.0 only exists in metastore, not yet exposed to SQL  Goal: different catalogs can have different security settings and defaults  Ongoing work, can be tracked at HIVE-18685
  26. 26. 26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example Installation With Catalogs Metastore LLAP Spark LLAP defaults to 'hive' catalog • Tables are ACID by default • Ranger for security • doAs=false Spark defaults to 'etl' catalog • ACID tables not allowed • StorageBasedAuth for security • doAs=true Each can still read from the other catalog (assuming permission granted), but can now be aware of changing authorization, defaults, etc. Also useful in the cloud, where multiple business units may sharing storage but need different defaults, policies, etc.
  27. 27. 27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved What Next?  Now that we have released the Metastore as a separate module, Hive community needs to decide whether it becomes a subproject or a separate top level project  Need to finish the work to integrate the Schema Registry  Need to involve contributors from other, non-Hive projects  Need to finish implementing Catalogs  Patches accepted!
  28. 28. 28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Credits  Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig, Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation projects – All are referred to herein without “Apache” for brevity  HDFS and MapReduce are components of Apache Hadoop  Thanks to the Hive community for their work in getting the Hive Metastore separated out from much of the rest of Hive  Google Translate used for Latin slide title
  29. 29. 29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Thank You

×