Contenu connexe Similaire à Row/Column- Level Security in SQL for Apache Spark (20) Plus de DataWorks Summit/Hadoop Summit (20) Row/Column- Level Security in SQL for Apache Spark1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row/Column-level
Security in SQL
for Apache Spark
Dongjoon Hyun – Software Engineer
Bikas Saha – Software Engineer
April 2017
2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am I
Software Engineer @ Hortonworks
Apache REEF PMC member and committer
Apache Spark project contributor
https://github.com/dongjoon-hyun
Dongjoon Hyun
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
One of fundamental features for enterprise adoption
– Multi-tenancy: Billing team / Data science team / Marketing teams
Row and column-level access control for SQL users
– Row filtering
– Column masking
Must enforce shared policies to various SQL engines simultaneously
– E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 1
Spark reads all or nothing
– Directory/File-based permissions are insufficient
Permission 777 on warehouse?
Security starts from storage
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 2
Spark apps should be rewritten
– Special data source tables
Duplicated data
– Filtered rows
– Removed or masked columns
SQL Views
– Maintained by manually
Overhead during starting and maintaining security policies
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 1: Spark SQL Apps
Support row/column-level security with the batch apps
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.enableHiveSupport()
.getOrCreate()
spark.sql("select * from db_common.t_customer").show()
db_common
t_customer
…
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (1/2)
Support row/column-level security in all shells
spark-shell
pyspark
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (2/2)
Support row/column-level security in all shells
sparkR
spark-sql
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 3: Spark Thrift Server
Support row/column-level security with Spark Thrift Server
Login as `hive`
Login as `spark`
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What are required?
Kerberos
Apache Hadoop (HDFS/YARN)
Apache Ranger
Apache Hive (LLAP)
Spark-LLAP: A library and patches to integrate the above
Focus here
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
Provide a standard authorization method across many Hadoop components
https://hortonworks.com/apache/ranger/#section_2
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Hive Ranger Plugin & Policies
– Support row/column-level security
LLAP Daemon (GA in HDP 2.6)
– Persistent query servers with intelligent in-memory caching
– Provide a secure relational datanode view of the data
Trusted Service
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP for Spark 1.6
• User should use LlapContext
• Support Scala/Java and spark-shell
HDP 2.5
var lc = new LlapContext(sc)
lc.sql("select * from t").show
Spark-LLAP (Technical Preview)
Milestone
Spark-LLAP for Spark 2.1
• No need to rewrite SQL related code
• Support all languages and shells
HDP 2.6 Next
Spark-LLAP for Spark 2.1
• Support YARN cluster mode
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP GitHub (Apache License)
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Case: spark-submit with YARN cluster mode
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
Existing InfraNew for Spark
New for Hive (GA)
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Manage
Hive Database: db_common
Table: *
Hive Column: *
Select User: spark
Permissions: SELECT
24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User
spark-submit
--jars spark-llap.jar
--conf spark.sql.hive.llap=true
--conf spark.yarn.security.credentials.hiveserver2.enabled=true
--master yarn
--deploy-mode cluster
sql.py
Launch Spark jobs
Note: There exists more static configurations related LLAP
`--package` option is supported, too
Easy to turn on/off
Only used for YARN cluster mode
25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
HDFS Delegation Token
– HDFSCredentialProvider gets it from namenode
Hive Metastore Delegation Token
– HiveCredentialProvider gets it from Hive Metastore
HiveServer2 Delegation Token
– HiveServer2CredentialProvider gets it from HiveServer2
Get delegation tokens
Spark-LLAP
Existing
Note: Spark manages token renewal
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
SELECT gender, count(*)
FROM db_common.t_customer
WHERE name LIKE '%Obama’
GROUP BY gender
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
UnresolvedRelation
Filter: name like %Obama
Parsed Logical Plan
Aggregate: gender
27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
Without Spark-LLAP
With Spark-LLAP
28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
Scan LlapRelation
PushedFilter:
StringEndsWith(name, Obama)
Filter: EndsWith(name, Obama)
Physical Plan
Project: gender
HashAggregate: gender
…
30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
Read filtered and masked data from LLAP
jobConf.set("hive.llap.zk.registry.user", "hive")
jobConf.set("llap.if.hs2.connection", parameters("url"))
jobConf.set("llap.if.query", queryString)
…
// Create Hadoop RDD and convert LLAP Row into Spark Row
sc.sparkContext
.hadoopRDD(…)
.mapPartitionsWithInputSplit(…)
32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Some related SPARK Issues
SPARK-14743 Add a configurable credential manager for Spark running on YARN
SPARK-15777 Catalog federation (Open)
SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)
SPARK-17819 Support default database in connection URIs for Spark Thrift Server
SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist
SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.
SPARK-18857 Don't use `Iterator.duplicate` in STS
SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems
SPARK-19038 Avoid overwriting keytab configuration in yarn-client
SPARK-19179 Change spark.yarn.access.namenodes config and update docs
SPARK-19970 Table owner should be USER instead of PRINCIPAL
SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
Support row/column-level security with
– Spark apps
– Spark shells
– Spark Thrift Server
You can use the existing Spark 2.X SQL apps and scripts
Easy to turn on/off with only configurations
Ranger enforces Hive/Spark simultaneously and consistently
Spark-LLAP with HDP 2.6 is TP
34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgement
Apache Hive / Apache Spark / Apache Ranger
Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and
many others