Contenu connexe
Similaire à HiveWarehouseConnector (20)
HiveWarehouseConnector
- 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HiveWarehouse
Connector
(Update)
Eric Wohlstadter – Hortonworks R&D
June 2018
- 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
HDP3 version of Spark-Hive Connector
Features
– Spark access to ACID tables
– Other integrations
• e.g. Spark access to Ranger tables
API and Architecture
– Reads from Hive to Spark
– Writes from Spark to Hive
- 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Features: Spark access to ACID tables
Hive supports traditional ACID semantics
– ORC with delta files to support low-latency writes
– Compaction to prevent storage fragmentation
– Custom readers to reconcile deltas on read
ACID tables use extended Metastore format
Spark doesn’t read/write ACID tables
Spark doesn’t use ACID Metastore format
Support Spark reads/writes for ACID tables
- 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
ACID
TablesX
X
Spark can’t read/write ACID tables
Spark doesn’t use ACID Metastore format
- 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
Isolate Spark and Hive Catalogs/Tables
Leverage connector for Spark <-> Hive
HWC
HWC
- 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Features: Spark access to Ranger tables
– Column-level access control
– Column masking
• “show only first four chars of string column”
– Row-level access control
• “show only rows WHERE …”
Support Spark reads/writes for Ranger tables Ranger UI
- 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
Latest version of Hive Connector library for Spark
Features
– Spark access to Ranger tables
– Spark access to ACID tables
API and Architecture
–Reads from Hive to Spark
– Writes from Spark to Hive
- 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
“JDBC-like” READ API
a) hive =
HiveWarehouseBuilder.session(spark).build()
• Create HiveWarehouseSession
b) hive.execute(sql : String): DataFrame
• SHOW, DESCRIBE, etc…
c) hive.executeQuery(sql: String): DataFrame
– SELECT, SELECT CTE, etc…
- 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Thrift JDBC)
Executors LLAP Daemons
• Driver submits catalog op to HiveServer
• HWC returns ResultSet as DataFrame
JDBC
b) hive.execute(“show databases”).show()
- 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (JDBC)
Executors LLAP Daemons
1
2
3
1. Driver submits query to HiveServer
2. Compile query and return ”splits” to Driver
3. Execute query on LLAP
c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show()
ACID
Tables
- 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Arrow)
Executors LLAP Daemons
4
5
4. Executor Tasks run for each split
5. Tasks reads Arrow data from LLAP
6. HWC returns ArrowColumnVectors to Spark
6
c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show()
ACID
Tables
- 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Other Recent READ improvements
Leverage Spark 2.3.1 support for Arrow
Implemented SupportsColumnBatchScan plugin
Add Hive Arrow SerDe
Add Arrow support to LlapOutputFormatService
- 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
Latest version of Hive Connector library for Spark
Features
– Spark access to Ranger tables
– Spark access to ACID tables
API and Architecture
– Reads from Hive to Spark
–Writes from Spark to Hive
- 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Connector WRITE API
a) hive.executeUpdate(sql : String) : Bool
• Create, Update, Alter, Insert, Merge, Delete, etc…
b) df.write.format(HIVE_WAREHOUSE_CONNECTOR)
• Write DataFrame using LOAD DATA INTO TABLE
c) df.write.format(STREAM_TO_STREAM)
• Write Streaming DataFrame using Hive-Streaming
- 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer2+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Thrift JDBC)
Executors LLAP Daemons
a) hive.executeUpdate(“INSERT INTO s SELECT * FROM t”)
1. Driver submits update op to HiveServer2
2. Process update through Tez and/or LLAP
3. HWC returns true on success
1
2
3
- 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example: LOAD to Hive
df.select("ws_sold_time_sk", "ws_ship_date_sk")
.filter("ws_sold_time_sk > 80000")
.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", “my_acid_table”)
.save()
- 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
b) df.write.format(HIVE_WAREHOUSE_CONNECTOR).save()
1. Driver launches DataWriter tasks
2. Tasks write ORC files
3. On commit, Driver executes LOAD DATA INTO TABLE
HDFS
/tmp
1
2
3
ACID
Tables
- 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example: Stream to Hive
val df = spark.readStream.format("socket")
...
.load()
df.writeStream.format(STREAM_TO_STREAM)
.option(“table”, “my_acid_table”)
.start()
- 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
Executors
Spark
Meta
Hive
Meta
Executors
c) df.write.format(STREAM_TO_STREAM).start()
1. Driver launches DataWriter tasks
2. Tasks open Txns
3. Write rows to ACID tables in Tx
ACID
Tables
1
2
3
- 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Other Recent Write Improvements
Implemented WriteSupport and StreamWriteSupport
Spark plugins
Improved Hive LOAD DATA INTO TABLE
– e.g. Support for bucketing and dynamic partitioning
Improved HiveStreaming
– e.g. Support for dynamic partitioning
- 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Compatibility Matrix
Connector
Branch
Spark Hive HDP
master
(Summer 2018)
2.3.1 3.1.0 3.0.0 (GA)
branch-2.3 2.3.0 2.1.0 2.6.5 (TP)
branch-2.2 2.2.0 2.1.0 2.6.3~4 (TP)
branch-2.1 2.1.1 2.1.0 2.6.0~2 (TP)
branch-1.6 1.6.3 2.1.0 2.5.x (TP)
https://github.com/hortonworks-spark/spark-llap
- 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Acknowledgements:
Teddy Choi, Jason Dere, Gunther Hagleitner,
Dongjoon Hyun, Prasanth Jayachandran, Hyukjin Kwon,
Bikas Saha, Jerry Zhao
Notes de l'éditeur
- ACID table details
Spark doesn’t support ACID tables
- Isolate Catalogs
Interoperate with Connector
- Other interoperability
Access to Hive tables mediated by Ranger
- Changes needed to read Hive
JDBC like API
- Bridge Hive catalog operations
- Execute query in Hive
Spark doesn’t directly access ACID tables
- HWC returns DataFrames that can be transformed by DataFrame API
- Support for writing SparkSQL Streams to ACID tables