SlideShare une entreprise Scribd logo
1  sur  29
Sharing Metadata
Across the Data Lake
and Streams
Alan F. Gates
Co-founder Hortonworks,
Member Apache Hive PMC
June 2018
2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Metadata in SQL
4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Big Data SQL Engines
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
Hive Impala
Presto Spark
5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Pro Pluribus Unum
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
These engines all store their metadata
in the Hive Metastore
Hive
Metastore
Hive Impala
Presto Spark
6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Good, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Good: Shared metadata makes sharing
data between engines easier
Hive Impala
Presto Spark
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Bad, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Hive Impala
Presto Spark
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute to the Metastore
Good: Shared metadata makes sharing
data between engines easier
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
And the First Proposal
Metastore
These engines all store their metadata in
the Hive Metastore
Hive Impala
Presto Spark
Proposal:
Separate the Metastore from Hive
Good: Shared metadata makes sharing
data between engines easier
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute the Metastore
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Breaking out the Metastore
 Enables the Metastore to continue to be used by many engines
 In Hive 3.0 the Metastore was released as a separate module
 Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– Planning to add these in the future
 Backwards compatibility maintained for Thrift clients
– Older version clients can talk to the new, separate, metastore
 A few small changes for server hook implementations
 There is a proposal to make it a separate Apache project
– Will enable better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this
10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Enables Shared Metadata in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance
11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Is this HCatalog 2.0?
 Didn’t we do this before? Wasn’t it called HCatalog?
 No, HCatalog is different
 HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
 Metastore stores metadata, including which serdes etc. to use but does not provide
readers and writers
 HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access
12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Schemas in Streams
13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example: Hortonworks Schema Registry
 Provides a central repository for messages’ metadata
 Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)
 Can be used by any application via REST interface
 Schema defined in JSON
 Schema is tied to a Kafka topic or NiFi flow
 Every schema has a name: e.g. temp_sensor_data
 Schemas can have one or more versions
– Different messages in a topic will have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
 Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED
 Serdes stored with schema so system knows how to (de)serialize data
14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example Schema Registry Schema
{ "name": "temp_sensor_data",
"fields": [
{ "name": "sensorId", "type": "long"},
{ "name": "location", "type": "record",
"fields": [
{ "name": "longitude", "type": "double"},
{ "name": "latitude", "type": "double"}
]},
{ "name": "temperature", "type": "int"},
{ "name": "readAt", "type": "long"}
]
}
15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Contrasting SQL and Registry Schemas
SQL Schema Registry
Schema tied to a table Schema tied to a Kafka topic or NiFi flow
Schema applies to all records in a partition Records in a topic may have different versions
of the schema, with no given order
Schema defined in SQL DDL
CREATE TABLE T (A INT, B VARCHAR(20));
Schema defined in JSON
Primary access is via SQL for users and Thrift
for SQL engines
Primary access is via UI for developers and
Java/REST for streaming applications
Supports standard SQL types and Java types Supports Java types
No concept of schema lifecycle Schema lifecycle management via schema
version state
16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Bringing the Strands Together
17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
First Problem
 Administrators have another system to install, monitor, update, …
 Developers must maintain two systems whose basic functionality, record & serve
runtime metadata, is the same
 Other systems that want to integrate with runtime metadata, security systems like
Ranger and Sentry and governance systems like Atlas, have to integrate with each
component separately
With both the Hive Metastore and the Schema Registry we are adding yet another
component to the system
18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Second Problem
 Sometimes your streaming application will want to read from a table
– It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka
stream
 Sometimes your query will want to read from a stream
– It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka
stream
 To share data today tools have to be able to read data using both paradigms
Hardwiring a perspective into a metadata repository makes it harder to share data
between applications
19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Second Proposal: Cross the Streams
 Put the Schema Registry on top of the Metastore
 It will still support SQL and streaming perspectives
 One system means less work for admins, developers, and other tools
 One system with multiple perspectives means
– streaming tools can view data as a stream whether it is in Kafka or Hive
– batch tools can view data as a table whether it is in Hive or Kafka
20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Streaming Application Reading from a Table
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache table every hour, do a join as events arrive to flag users who need extra attention
• Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Example:
• Hive has record of support calls, Kafka does not
21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Query Reading from a Stream
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few
seconds from Kafka
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Some Assembly Required
 Need to bridge the gaps between SQL and Registry schemas - Nontrivial
– Schema consistent for all records in a partition versus different schema versions in the stream
– SQL types versus Java types
– Schema as an attribute of a table versus as a first class object with version and lifecycle
 Will require connectors so streaming apps can use batch serdes and vice versa
 Work in progress:
– https://github.com/apache/hive/pull/347
– https://github.com/apache/hive/pull/348
– https://github.com/apache/hive/pull/349
– https://issues.apache.org/jira/browse/HIVE-19521
– https://issues.apache.org/jira/browse/HIVE-19522
23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Can We Share Too Much?
24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Yes, Yes We Can
Example use case: Hive LLAP being used for analytics, Spark for ETL
Metastore
LLAP Spark
I have been extolling the benefits of a shared Metastore for the last 20 slides, so
clearly we want to share one instance between them
But,
• Hive and Spark can't always read each other's data
• e.g. Spark can't read Hive's ACID tables
• Different use cases require different security models
• e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger
• Different defaults are appropriate for different use cases
• e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog
25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Third Proposal: Add Catalogs
 Catalog is standard SQL top level container
 Catalogs contain databases, thus fully addressing a table will become
catalog.database.table
 Default catalog 'hive' added in 3.0, and all existing databases placed in it
 In 3.0 only exists in metastore, not yet exposed to SQL
 Goal: different catalogs can have different security settings and defaults
 Ongoing work, can be tracked at HIVE-18685
26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example Installation With Catalogs
Metastore
LLAP Spark
LLAP defaults to 'hive' catalog
• Tables are ACID by default
• Ranger for security
• doAs=false
Spark defaults to 'etl' catalog
• ACID tables not allowed
• StorageBasedAuth for security
• doAs=true
Each can still read from the other catalog (assuming permission granted),
but can now be aware of changing authorization, defaults, etc.
Also useful in the cloud, where multiple business units may sharing
storage but need different defaults, policies, etc.
27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
What Next?
 Now that we have released the Metastore as a separate module, Hive community needs
to decide whether it becomes a subproject or a separate top level project
 Need to finish the work to integrate the Schema Registry
 Need to involve contributors from other, non-Hive projects
 Need to finish implementing Catalogs
 Patches accepted!
28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Credits
 Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation
projects
– All are referred to herein without “Apache” for brevity
 HDFS and MapReduce are components of Apache Hadoop
 Thanks to the Hive community for their work in getting the Hive Metastore separated
out from much of the rest of Hive
 Google Translate used for Latin slide title
29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Thank You

Contenu connexe

Tendances

Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 

Tendances (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop Zoo
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 

Similaire à Sharing metadata across the data lake and streams

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 

Similaire à Sharing metadata across the data lake and streams (20)

Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics Manager
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Sharing metadata across the data lake and streams

  • 1. Sharing Metadata Across the Data Lake and Streams Alan F. Gates Co-founder Hortonworks, Member Apache Hive PMC June 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
  • 3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Metadata in SQL
  • 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Big Data SQL Engines There are many big data SQL engines: Hive, Spark, Impala, Presto, … Hive Impala Presto Spark
  • 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Pro Pluribus Unum There are many big data SQL engines: Hive, Spark, Impala, Presto, … These engines all store their metadata in the Hive Metastore Hive Metastore Hive Impala Presto Spark
  • 6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Good, … Hive Metastore These engines all store their metadata in the Hive Metastore Good: Shared metadata makes sharing data between engines easier Hive Impala Presto Spark There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  • 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Bad, … Hive Metastore These engines all store their metadata in the Hive Metastore Hive Impala Presto Spark Bad: Non-Hive systems have to install much of Hive to get the Metastore Bad: Hard for other projects to contribute to the Metastore Good: Shared metadata makes sharing data between engines easier There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  • 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved And the First Proposal Metastore These engines all store their metadata in the Hive Metastore Hive Impala Presto Spark Proposal: Separate the Metastore from Hive Good: Shared metadata makes sharing data between engines easier Bad: Non-Hive systems have to install much of Hive to get the Metastore Bad: Hard for other projects to contribute the Metastore There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  • 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Breaking out the Metastore  Enables the Metastore to continue to be used by many engines  In Hive 3.0 the Metastore was released as a separate module  Can be installed and run without the rest of Hive – A few features missing when Hive not present: e.g. the compactor – Planning to add these in the future  Backwards compatibility maintained for Thrift clients – Older version clients can talk to the new, separate, metastore  A few small changes for server hook implementations  There is a proposal to make it a separate Apache project – Will enable better collaboration with non-Hive projects – Still in discussion with the Hive PMC on this
  • 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Enables Shared Metadata in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance
  • 11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Is this HCatalog 2.0?  Didn’t we do this before? Wasn’t it called HCatalog?  No, HCatalog is different  HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other applications – Includes metadata access – Also includes data access (serdes, object inspectors, and input/output formats)  Metastore stores metadata, including which serdes etc. to use but does not provide readers and writers  HCatalog stays with Hive in this split, it does not go with the Metastore – Because it includes the data access
  • 12. 12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Schemas in Streams
  • 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example: Hortonworks Schema Registry  Provides a central repository for messages’ metadata  Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)  Can be used by any application via REST interface  Schema defined in JSON  Schema is tied to a Kafka topic or NiFi flow  Every schema has a name: e.g. temp_sensor_data  Schemas can have one or more versions – Different messages in a topic will have different versions of the schema – Compatibility between schema versions can be none, backwards, forwards, or both  Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED  Serdes stored with schema so system knows how to (de)serialize data
  • 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example Schema Registry Schema { "name": "temp_sensor_data", "fields": [ { "name": "sensorId", "type": "long"}, { "name": "location", "type": "record", "fields": [ { "name": "longitude", "type": "double"}, { "name": "latitude", "type": "double"} ]}, { "name": "temperature", "type": "int"}, { "name": "readAt", "type": "long"} ] }
  • 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Contrasting SQL and Registry Schemas SQL Schema Registry Schema tied to a table Schema tied to a Kafka topic or NiFi flow Schema applies to all records in a partition Records in a topic may have different versions of the schema, with no given order Schema defined in SQL DDL CREATE TABLE T (A INT, B VARCHAR(20)); Schema defined in JSON Primary access is via SQL for users and Thrift for SQL engines Primary access is via UI for developers and Java/REST for streaming applications Supports standard SQL types and Java types Supports Java types No concept of schema lifecycle Schema lifecycle management via schema version state
  • 16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Bringing the Strands Together
  • 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved First Problem  Administrators have another system to install, monitor, update, …  Developers must maintain two systems whose basic functionality, record & serve runtime metadata, is the same  Other systems that want to integrate with runtime metadata, security systems like Ranger and Sentry and governance systems like Atlas, have to integrate with each component separately With both the Hive Metastore and the Schema Registry we are adding yet another component to the system
  • 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Second Problem  Sometimes your streaming application will want to read from a table – It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka stream  Sometimes your query will want to read from a stream – It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka stream  To share data today tools have to be able to read data using both paradigms Hardwiring a perspective into a metadata repository makes it harder to share data between applications
  • 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Second Proposal: Cross the Streams  Put the Schema Registry on top of the Metastore  It will still support SQL and streaming perspectives  One system means less work for admins, developers, and other tools  One system with multiple perspectives means – streaming tools can view data as a stream whether it is in Kafka or Hive – batch tools can view data as a table whether it is in Hive or Kafka
  • 20. 20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Streaming Application Reading from a Table Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • A stream userEvents • An application that flags users who have called support in the last 24 hours Hive table support_calls userid long calltime timestamp summary string supportCalls Schema: { "group": "hive", "fields": [{ "userid": "long", "calltime": "timestamp", "summary" : "string" }] } • App can cache table every hour, do a join as events arrive to flag users who need extra attention • Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka • Because HMS and SR are unified, streaming apps can view this as an SR Schema Example: • Hive has record of support calls, Kafka does not
  • 21. 21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Query Reading from a Stream Hive table user_events, partitioned by event_hour user_id long event_type varchar(256) event_hour datetime Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • Hive table user_events is loaded every hour from Kafka topic userEvents Example: • Because HMS and SR are unified, Hive can view Kafka topic as partition of its table Hive table user_events, partition event_hour='latest' • Hive queries can now read Kafka topic userEvents as a partition of user_events • Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few seconds from Kafka • Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
  • 22. 22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Some Assembly Required  Need to bridge the gaps between SQL and Registry schemas - Nontrivial – Schema consistent for all records in a partition versus different schema versions in the stream – SQL types versus Java types – Schema as an attribute of a table versus as a first class object with version and lifecycle  Will require connectors so streaming apps can use batch serdes and vice versa  Work in progress: – https://github.com/apache/hive/pull/347 – https://github.com/apache/hive/pull/348 – https://github.com/apache/hive/pull/349 – https://issues.apache.org/jira/browse/HIVE-19521 – https://issues.apache.org/jira/browse/HIVE-19522
  • 23. 23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Can We Share Too Much?
  • 24. 24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Yes, Yes We Can Example use case: Hive LLAP being used for analytics, Spark for ETL Metastore LLAP Spark I have been extolling the benefits of a shared Metastore for the last 20 slides, so clearly we want to share one instance between them But, • Hive and Spark can't always read each other's data • e.g. Spark can't read Hive's ACID tables • Different use cases require different security models • e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger • Different defaults are appropriate for different use cases • e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog
  • 25. 25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Third Proposal: Add Catalogs  Catalog is standard SQL top level container  Catalogs contain databases, thus fully addressing a table will become catalog.database.table  Default catalog 'hive' added in 3.0, and all existing databases placed in it  In 3.0 only exists in metastore, not yet exposed to SQL  Goal: different catalogs can have different security settings and defaults  Ongoing work, can be tracked at HIVE-18685
  • 26. 26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example Installation With Catalogs Metastore LLAP Spark LLAP defaults to 'hive' catalog • Tables are ACID by default • Ranger for security • doAs=false Spark defaults to 'etl' catalog • ACID tables not allowed • StorageBasedAuth for security • doAs=true Each can still read from the other catalog (assuming permission granted), but can now be aware of changing authorization, defaults, etc. Also useful in the cloud, where multiple business units may sharing storage but need different defaults, policies, etc.
  • 27. 27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved What Next?  Now that we have released the Metastore as a separate module, Hive community needs to decide whether it becomes a subproject or a separate top level project  Need to finish the work to integrate the Schema Registry  Need to involve contributors from other, non-Hive projects  Need to finish implementing Catalogs  Patches accepted!
  • 28. 28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Credits  Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig, Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation projects – All are referred to herein without “Apache” for brevity  HDFS and MapReduce are components of Apache Hadoop  Thanks to the Hive community for their work in getting the Hive Metastore separated out from much of the rest of Hive  Google Translate used for Latin slide title
  • 29. 29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Thank You