HiveWarehouseConnector

1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HiveWarehouse
Connector
(Update)
Eric Wohlstadter – Hortonworks R&D
June 2018

Overview
 HDP3 version of Spark-Hive Connector
 Features
– Spark access to ACID tables
– Other integrations
• e.g. Spark access to Ranger tables
 API and Architecture
– Reads from Hive to Spark
– Writes from Spark to Hive

Features: Spark access to ACID tables
 Hive supports traditional ACID semantics
– ORC with delta files to support low-latency writes
– Compaction to prevent storage fragmentation
– Custom readers to reconcile deltas on read
 ACID tables use extended Metastore format
 Spark doesn’t read/write ACID tables
 Spark doesn’t use ACID Metastore format
Support Spark reads/writes for ACID tables

Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
ACID
TablesX
X
Spark can’t read/write ACID tables
Spark doesn’t use ACID Metastore format

Driver
MetaStore
HiveServer+Tez
Spark
Meta
Hive
Meta
Isolate Spark and Hive Catalogs/Tables
Leverage connector for Spark <-> Hive
HWC
HWC

Features: Spark access to Ranger tables
– Column-level access control
– Column masking
• “show only first four chars of string column”
– Row-level access control
• “show only rows WHERE …”
Support Spark reads/writes for Ranger tables Ranger UI

Overview
 Latest version of Hive Connector library for Spark
 Features
– Spark access to Ranger tables
–Reads from Hive to Spark
– Writes from Spark to Hive

“JDBC-like” READ API
a) hive =
HiveWarehouseBuilder.session(spark).build()
• Create HiveWarehouseSession
b) hive.execute(sql : String): DataFrame
• SHOW, DESCRIBE, etc…
c) hive.executeQuery(sql: String): DataFrame
– SELECT, SELECT CTE, etc…

Driver
MetaStore
HiveServer+Tez
Spark
Meta
Hive
Meta
HWC (Thrift JDBC)
• Driver submits catalog op to HiveServer
• HWC returns ResultSet as DataFrame
JDBC
b) hive.execute(“show databases”).show()

Driver
MetaStore
HiveServer+Tez
Spark
Meta
Hive
Meta
HWC (JDBC)
1
2
3
1. Driver submits query to HiveServer
2. Compile query and return ”splits” to Driver
3. Execute query on LLAP
c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show()
ACID
Tables

Driver
MetaStore
HiveServer+Tez
Spark
Meta
Hive
Meta
HWC (Arrow)
4
5
4. Executor Tasks run for each split
5. Tasks reads Arrow data from LLAP
6. HWC returns ArrowColumnVectors to Spark
6
c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show()
ACID
Tables

Other Recent READ improvements
 Leverage Spark 2.3.1 support for Arrow
 Implemented SupportsColumnBatchScan plugin
 Add Hive Arrow SerDe
 Add Arrow support to LlapOutputFormatService

Overview
 Latest version of Hive Connector library for Spark
 Features
– Spark access to Ranger tables
– Reads from Hive to Spark
–Writes from Spark to Hive

Connector WRITE API
a) hive.executeUpdate(sql : String) : Bool
• Create, Update, Alter, Insert, Merge, Delete, etc…
b) df.write.format(HIVE_WAREHOUSE_CONNECTOR)
• Write DataFrame using LOAD DATA INTO TABLE
c) df.write.format(STREAM_TO_STREAM)
• Write Streaming DataFrame using Hive-Streaming

Driver
MetaStore
HiveServer2+Tez
Spark
Meta
Hive
Meta
HWC (Thrift JDBC)
a) hive.executeUpdate(“INSERT INTO s SELECT * FROM t”)
1. Driver submits update op to HiveServer2
2. Process update through Tez and/or LLAP
3. HWC returns true on success
1
2
3

Example: LOAD to Hive
df.select("ws_sold_time_sk", "ws_ship_date_sk")
.filter("ws_sold_time_sk > 80000")
.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", “my_acid_table”)
.save()

Driver
MetaStore
HiveServer+Tez
Spark
Meta
Hive
Meta
b) df.write.format(HIVE_WAREHOUSE_CONNECTOR).save()
1. Driver launches DataWriter tasks
2. Tasks write ORC files
3. On commit, Driver executes LOAD DATA INTO TABLE
HDFS
/tmp
1
2
3
ACID
Tables

Example: Stream to Hive
val df = spark.readStream.format("socket")
...
.load()
df.writeStream.format(STREAM_TO_STREAM)
.option(“table”, “my_acid_table”)
.start()

Driver
MetaStore
HiveServer+Tez
Executors
Spark
Meta
Hive
Meta
Executors
c) df.write.format(STREAM_TO_STREAM).start()
1. Driver launches DataWriter tasks
2. Tasks open Txns
3. Write rows to ACID tables in Tx
ACID
Tables
1
2
3

Other Recent Write Improvements
 Implemented WriteSupport and StreamWriteSupport
Spark plugins
 Improved Hive LOAD DATA INTO TABLE
– e.g. Support for bucketing and dynamic partitioning
 Improved HiveStreaming
– e.g. Support for dynamic partitioning

Compatibility Matrix
Connector
Branch
Spark Hive HDP
master
(Summer 2018)
2.3.1 3.1.0 3.0.0 (GA)
branch-2.3 2.3.0 2.1.0 2.6.5 (TP)
branch-2.2 2.2.0 2.1.0 2.6.3~4 (TP)
branch-2.1 2.1.1 2.1.0 2.6.0~2 (TP)
branch-1.6 1.6.3 2.1.0 2.5.x (TP)
https://github.com/hortonworks-spark/spark-llap

Acknowledgements:
Teddy Choi, Jason Dere, Gunther Hagleitner,
Dongjoon Hyun, Prasanth Jayachandran, Hyukjin Kwon,
Bikas Saha, Jerry Zhao

HiveWarehouseConnector

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à HiveWarehouseConnector

Similaire à HiveWarehouseConnector (20)

Dernier

Dernier (20)

HiveWarehouseConnector

Notes de l'éditeur