Sharing metadata across the data lake and streams

Sharing Metadata
Across the Data Lake
and Streams
Alan F. Gates
Co-founder Hortonworks,
Member Apache Hive PMC
June 2018

Metadata in SQL

Big Data SQL Engines
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
Hive Impala
Presto Spark

Pro Pluribus Unum
These engines all store their metadata
in the Hive Metastore
Hive
Metastore
Hive Impala
Presto Spark

The Good, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Good: Shared metadata makes sharing
data between engines easier
Hive Impala
Presto Spark

The Bad, …
Hive
Metastore
the Hive Metastore
Hive Impala
Presto Spark
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute to the Metastore

And the First Proposal
Metastore
the Hive Metastore
Hive Impala
Presto Spark
Proposal:
Separate the Metastore from Hive
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute the Metastore

Breaking out the Metastore
 Enables the Metastore to continue to be used by many engines
 In Hive 3.0 the Metastore was released as a separate module
 Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– Planning to add these in the future
 Backwards compatibility maintained for Thrift clients
– Older version clients can talk to the new, separate, metastore
 A few small changes for server hook implementations
 There is a proposal to make it a separate Apache project
– Will enable better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this

Enables Shared Metadata in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance

Is this HCatalog 2.0?
 Didn’t we do this before? Wasn’t it called HCatalog?
 No, HCatalog is different
 HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
 Metastore stores metadata, including which serdes etc. to use but does not provide
readers and writers
 HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access

Schemas in Streams

Example: Hortonworks Schema Registry
 Provides a central repository for messages’ metadata
 Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)
 Can be used by any application via REST interface
 Schema defined in JSON
 Schema is tied to a Kafka topic or NiFi flow
 Every schema has a name: e.g. temp_sensor_data
 Schemas can have one or more versions
– Different messages in a topic will have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
 Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED
 Serdes stored with schema so system knows how to (de)serialize data

Example Schema Registry Schema
{ "name": "temp_sensor_data",
"fields": [
{ "name": "sensorId", "type": "long"},
{ "name": "location", "type": "record",
"fields": [
{ "name": "longitude", "type": "double"},
{ "name": "latitude", "type": "double"}
]},
{ "name": "temperature", "type": "int"},
{ "name": "readAt", "type": "long"}
]
}

Contrasting SQL and Registry Schemas
SQL Schema Registry
Schema tied to a table Schema tied to a Kafka topic or NiFi flow
Schema applies to all records in a partition Records in a topic may have different versions
of the schema, with no given order
Schema defined in SQL DDL
CREATE TABLE T (A INT, B VARCHAR(20));
Schema defined in JSON
Primary access is via SQL for users and Thrift
for SQL engines
Primary access is via UI for developers and
Java/REST for streaming applications
Supports standard SQL types and Java types Supports Java types
No concept of schema lifecycle Schema lifecycle management via schema
version state

Bringing the Strands Together

First Problem
 Administrators have another system to install, monitor, update, …
 Developers must maintain two systems whose basic functionality, record & serve
runtime metadata, is the same
 Other systems that want to integrate with runtime metadata, security systems like
Ranger and Sentry and governance systems like Atlas, have to integrate with each
component separately
With both the Hive Metastore and the Schema Registry we are adding yet another
component to the system

Second Problem
 Sometimes your streaming application will want to read from a table
– It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka
stream
 Sometimes your query will want to read from a stream
– It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka
stream
 To share data today tools have to be able to read data using both paradigms
Hardwiring a perspective into a metadata repository makes it harder to share data
between applications

The Second Proposal: Cross the Streams
 Put the Schema Registry on top of the Metastore
 It will still support SQL and streaming perspectives
 One system means less work for admins, developers, and other tools
 One system with multiple perspectives means
– streaming tools can view data as a stream whether it is in Kafka or Hive
– batch tools can view data as a table whether it is in Hive or Kafka

Streaming Application Reading from a Table
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache table every hour, do a join as events arrive to flag users who need extra attention
• Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Example:
• Hive has record of support calls, Kafka does not

Query Reading from a Stream
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few
seconds from Kafka
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive

Some Assembly Required
 Need to bridge the gaps between SQL and Registry schemas - Nontrivial
– Schema consistent for all records in a partition versus different schema versions in the stream
– SQL types versus Java types
– Schema as an attribute of a table versus as a first class object with version and lifecycle
 Will require connectors so streaming apps can use batch serdes and vice versa
 Work in progress:
– https://github.com/apache/hive/pull/347
– https://issues.apache.org/jira/browse/HIVE-19521
– https://issues.apache.org/jira/browse/HIVE-19522

Can We Share Too Much?

Yes, Yes We Can
Example use case: Hive LLAP being used for analytics, Spark for ETL
Metastore
LLAP Spark
I have been extolling the benefits of a shared Metastore for the last 20 slides, so
clearly we want to share one instance between them
But,
• Hive and Spark can't always read each other's data
• e.g. Spark can't read Hive's ACID tables
• Different use cases require different security models
• e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger
• Different defaults are appropriate for different use cases
• e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog

Third Proposal: Add Catalogs
 Catalog is standard SQL top level container
 Catalogs contain databases, thus fully addressing a table will become
catalog.database.table
 Default catalog 'hive' added in 3.0, and all existing databases placed in it
 In 3.0 only exists in metastore, not yet exposed to SQL
 Goal: different catalogs can have different security settings and defaults
 Ongoing work, can be tracked at HIVE-18685

Example Installation With Catalogs
Metastore
LLAP Spark
LLAP defaults to 'hive' catalog
• Tables are ACID by default
• Ranger for security
• doAs=false
Spark defaults to 'etl' catalog
• ACID tables not allowed
• StorageBasedAuth for security
• doAs=true
Each can still read from the other catalog (assuming permission granted),
but can now be aware of changing authorization, defaults, etc.
Also useful in the cloud, where multiple business units may sharing
storage but need different defaults, policies, etc.

What Next?
 Now that we have released the Metastore as a separate module, Hive community needs
to decide whether it becomes a subproject or a separate top level project
 Need to finish the work to integrate the Schema Registry
 Need to involve contributors from other, non-Hive projects
 Need to finish implementing Catalogs
 Patches accepted!

Credits
 Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation
projects
– All are referred to herein without “Apache” for brevity
 HDFS and MapReduce are components of Apache Hadoop
 Thanks to the Hive community for their work in getting the Hive Metastore separated
out from much of the rest of Hive
 Google Translate used for Latin slide title

Thank You

Sharing metadata across the data lake and streams

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Sharing metadata across the data lake and streams

Similaire à Sharing metadata across the data lake and streams (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Sharing metadata across the data lake and streams