SlideShare une entreprise Scribd logo
1  sur  22
1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HiveWarehouse
Connector
(Update)
Eric Wohlstadter – Hortonworks R&D
June 2018
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
 HDP3 version of Spark-Hive Connector
 Features
– Spark access to ACID tables
– Other integrations
• e.g. Spark access to Ranger tables
 API and Architecture
– Reads from Hive to Spark
– Writes from Spark to Hive
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Features: Spark access to ACID tables
 Hive supports traditional ACID semantics
– ORC with delta files to support low-latency writes
– Compaction to prevent storage fragmentation
– Custom readers to reconcile deltas on read
 ACID tables use extended Metastore format
 Spark doesn’t read/write ACID tables
 Spark doesn’t use ACID Metastore format
Support Spark reads/writes for ACID tables
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
ACID
TablesX
X
Spark can’t read/write ACID tables
Spark doesn’t use ACID Metastore format
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
Isolate Spark and Hive Catalogs/Tables
Leverage connector for Spark <-> Hive
HWC
HWC
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Features: Spark access to Ranger tables
– Column-level access control
– Column masking
• “show only first four chars of string column”
– Row-level access control
• “show only rows WHERE …”
Support Spark reads/writes for Ranger tables Ranger UI
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
 Latest version of Hive Connector library for Spark
 Features
– Spark access to Ranger tables
– Spark access to ACID tables
 API and Architecture
–Reads from Hive to Spark
– Writes from Spark to Hive
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
“JDBC-like” READ API
a) hive =
HiveWarehouseBuilder.session(spark).build()
• Create HiveWarehouseSession
b) hive.execute(sql : String): DataFrame
• SHOW, DESCRIBE, etc…
c) hive.executeQuery(sql: String): DataFrame
– SELECT, SELECT CTE, etc…
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Thrift JDBC)
Executors LLAP Daemons
• Driver submits catalog op to HiveServer
• HWC returns ResultSet as DataFrame
JDBC
b) hive.execute(“show databases”).show()
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (JDBC)
Executors LLAP Daemons
1
2
3
1. Driver submits query to HiveServer
2. Compile query and return ”splits” to Driver
3. Execute query on LLAP
c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show()
ACID
Tables
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Arrow)
Executors LLAP Daemons
4
5
4. Executor Tasks run for each split
5. Tasks reads Arrow data from LLAP
6. HWC returns ArrowColumnVectors to Spark
6
c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show()
ACID
Tables
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Other Recent READ improvements
 Leverage Spark 2.3.1 support for Arrow
 Implemented SupportsColumnBatchScan plugin
 Add Hive Arrow SerDe
 Add Arrow support to LlapOutputFormatService
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
 Latest version of Hive Connector library for Spark
 Features
– Spark access to Ranger tables
– Spark access to ACID tables
 API and Architecture
– Reads from Hive to Spark
–Writes from Spark to Hive
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Connector WRITE API
a) hive.executeUpdate(sql : String) : Bool
• Create, Update, Alter, Insert, Merge, Delete, etc…
b) df.write.format(HIVE_WAREHOUSE_CONNECTOR)
• Write DataFrame using LOAD DATA INTO TABLE
c) df.write.format(STREAM_TO_STREAM)
• Write Streaming DataFrame using Hive-Streaming
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer2+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Thrift JDBC)
Executors LLAP Daemons
a) hive.executeUpdate(“INSERT INTO s SELECT * FROM t”)
1. Driver submits update op to HiveServer2
2. Process update through Tez and/or LLAP
3. HWC returns true on success
1
2
3
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example: LOAD to Hive
df.select("ws_sold_time_sk", "ws_ship_date_sk")
.filter("ws_sold_time_sk > 80000")
.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", “my_acid_table”)
.save()
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
b) df.write.format(HIVE_WAREHOUSE_CONNECTOR).save()
1. Driver launches DataWriter tasks
2. Tasks write ORC files
3. On commit, Driver executes LOAD DATA INTO TABLE
HDFS
/tmp
1
2
3
ACID
Tables
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example: Stream to Hive
val df = spark.readStream.format("socket")
...
.load()
df.writeStream.format(STREAM_TO_STREAM)
.option(“table”, “my_acid_table”)
.start()
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Driver
MetaStore
HiveServer+Tez
Executors
Spark
Meta
Hive
Meta
Executors
c) df.write.format(STREAM_TO_STREAM).start()
1. Driver launches DataWriter tasks
2. Tasks open Txns
3. Write rows to ACID tables in Tx
ACID
Tables
1
2
3
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Other Recent Write Improvements
 Implemented WriteSupport and StreamWriteSupport
Spark plugins
 Improved Hive LOAD DATA INTO TABLE
– e.g. Support for bucketing and dynamic partitioning
 Improved HiveStreaming
– e.g. Support for dynamic partitioning
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Compatibility Matrix
Connector
Branch
Spark Hive HDP
master
(Summer 2018)
2.3.1 3.1.0 3.0.0 (GA)
branch-2.3 2.3.0 2.1.0 2.6.5 (TP)
branch-2.2 2.2.0 2.1.0 2.6.3~4 (TP)
branch-2.1 2.1.1 2.1.0 2.6.0~2 (TP)
branch-1.6 1.6.3 2.1.0 2.5.x (TP)
https://github.com/hortonworks-spark/spark-llap
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Acknowledgements:
Teddy Choi, Jason Dere, Gunther Hagleitner,
Dongjoon Hyun, Prasanth Jayachandran, Hyukjin Kwon,
Bikas Saha, Jerry Zhao

Contenu connexe

Tendances

Tendances (20)

Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Apache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 OverviewApache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 Overview
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Apache Phoenix Query Server
Apache Phoenix Query ServerApache Phoenix Query Server
Apache Phoenix Query Server
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerBreathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
 

Similaire à HiveWarehouseConnector

Similaire à HiveWarehouseConnector (20)

Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 

Dernier

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Dernier (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

HiveWarehouseConnector

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HiveWarehouse Connector (Update) Eric Wohlstadter – Hortonworks R&D June 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Overview  HDP3 version of Spark-Hive Connector  Features – Spark access to ACID tables – Other integrations • e.g. Spark access to Ranger tables  API and Architecture – Reads from Hive to Spark – Writes from Spark to Hive
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Features: Spark access to ACID tables  Hive supports traditional ACID semantics – ORC with delta files to support low-latency writes – Compaction to prevent storage fragmentation – Custom readers to reconcile deltas on read  ACID tables use extended Metastore format  Spark doesn’t read/write ACID tables  Spark doesn’t use ACID Metastore format Support Spark reads/writes for ACID tables
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons ACID TablesX X Spark can’t read/write ACID tables Spark doesn’t use ACID Metastore format
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons Isolate Spark and Hive Catalogs/Tables Leverage connector for Spark <-> Hive HWC HWC
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Features: Spark access to Ranger tables – Column-level access control – Column masking • “show only first four chars of string column” – Row-level access control • “show only rows WHERE …” Support Spark reads/writes for Ranger tables Ranger UI
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Overview  Latest version of Hive Connector library for Spark  Features – Spark access to Ranger tables – Spark access to ACID tables  API and Architecture –Reads from Hive to Spark – Writes from Spark to Hive
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved “JDBC-like” READ API a) hive = HiveWarehouseBuilder.session(spark).build() • Create HiveWarehouseSession b) hive.execute(sql : String): DataFrame • SHOW, DESCRIBE, etc… c) hive.executeQuery(sql: String): DataFrame – SELECT, SELECT CTE, etc…
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta HWC (Thrift JDBC) Executors LLAP Daemons • Driver submits catalog op to HiveServer • HWC returns ResultSet as DataFrame JDBC b) hive.execute(“show databases”).show()
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta HWC (JDBC) Executors LLAP Daemons 1 2 3 1. Driver submits query to HiveServer 2. Compile query and return ”splits” to Driver 3. Execute query on LLAP c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show() ACID Tables
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta HWC (Arrow) Executors LLAP Daemons 4 5 4. Executor Tasks run for each split 5. Tasks reads Arrow data from LLAP 6. HWC returns ArrowColumnVectors to Spark 6 c) hive.executeQuery(“SELECT * FROM t”).sort(“A”).show() ACID Tables
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Other Recent READ improvements  Leverage Spark 2.3.1 support for Arrow  Implemented SupportsColumnBatchScan plugin  Add Hive Arrow SerDe  Add Arrow support to LlapOutputFormatService
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Overview  Latest version of Hive Connector library for Spark  Features – Spark access to Ranger tables – Spark access to ACID tables  API and Architecture – Reads from Hive to Spark –Writes from Spark to Hive
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Connector WRITE API a) hive.executeUpdate(sql : String) : Bool • Create, Update, Alter, Insert, Merge, Delete, etc… b) df.write.format(HIVE_WAREHOUSE_CONNECTOR) • Write DataFrame using LOAD DATA INTO TABLE c) df.write.format(STREAM_TO_STREAM) • Write Streaming DataFrame using Hive-Streaming
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer2+Tez LLAP DaemonsExecutors Spark Meta Hive Meta HWC (Thrift JDBC) Executors LLAP Daemons a) hive.executeUpdate(“INSERT INTO s SELECT * FROM t”) 1. Driver submits update op to HiveServer2 2. Process update through Tez and/or LLAP 3. HWC returns true on success 1 2 3
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Example: LOAD to Hive df.select("ws_sold_time_sk", "ws_ship_date_sk") .filter("ws_sold_time_sk > 80000") .write.format(HIVE_WAREHOUSE_CONNECTOR) .option("table", “my_acid_table”) .save()
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons b) df.write.format(HIVE_WAREHOUSE_CONNECTOR).save() 1. Driver launches DataWriter tasks 2. Tasks write ORC files 3. On commit, Driver executes LOAD DATA INTO TABLE HDFS /tmp 1 2 3 ACID Tables
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Example: Stream to Hive val df = spark.readStream.format("socket") ... .load() df.writeStream.format(STREAM_TO_STREAM) .option(“table”, “my_acid_table”) .start()
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Driver MetaStore HiveServer+Tez Executors Spark Meta Hive Meta Executors c) df.write.format(STREAM_TO_STREAM).start() 1. Driver launches DataWriter tasks 2. Tasks open Txns 3. Write rows to ACID tables in Tx ACID Tables 1 2 3
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Other Recent Write Improvements  Implemented WriteSupport and StreamWriteSupport Spark plugins  Improved Hive LOAD DATA INTO TABLE – e.g. Support for bucketing and dynamic partitioning  Improved HiveStreaming – e.g. Support for dynamic partitioning
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Compatibility Matrix Connector Branch Spark Hive HDP master (Summer 2018) 2.3.1 3.1.0 3.0.0 (GA) branch-2.3 2.3.0 2.1.0 2.6.5 (TP) branch-2.2 2.2.0 2.1.0 2.6.3~4 (TP) branch-2.1 2.1.1 2.1.0 2.6.0~2 (TP) branch-1.6 1.6.3 2.1.0 2.5.x (TP) https://github.com/hortonworks-spark/spark-llap
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Acknowledgements: Teddy Choi, Jason Dere, Gunther Hagleitner, Dongjoon Hyun, Prasanth Jayachandran, Hyukjin Kwon, Bikas Saha, Jerry Zhao

Notes de l'éditeur

  1. ACID table details Spark doesn’t support ACID tables
  2. Isolate Catalogs Interoperate with Connector
  3. Other interoperability Access to Hive tables mediated by Ranger
  4. Changes needed to read Hive JDBC like API
  5. Bridge Hive catalog operations
  6. Execute query in Hive Spark doesn’t directly access ACID tables
  7. HWC returns DataFrames that can be transformed by DataFrame API
  8. Support for writing SparkSQL Streams to ACID tables