SlideShare une entreprise Scribd logo
1  sur  35
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row/Column-level
Security in SQL
for Apache Spark
Dongjoon Hyun – Software Engineer
Bikas Saha – Software Engineer
April 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am I
 Software Engineer @ Hortonworks
 Apache REEF PMC member and committer
 Apache Spark project contributor
 https://github.com/dongjoon-hyun
Dongjoon Hyun
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
 One of fundamental features for enterprise adoption
– Multi-tenancy: Billing team / Data science team / Marketing teams
 Row and column-level access control for SQL users
– Row filtering
– Column masking
 Must enforce shared policies to various SQL engines simultaneously
– E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 1
 Spark reads all or nothing
– Directory/File-based permissions are insufficient
 Permission 777 on warehouse?
Security starts from storage
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 2
 Spark apps should be rewritten
– Special data source tables
 Duplicated data
– Filtered rows
– Removed or masked columns
 SQL Views
– Maintained by manually
Overhead during starting and maintaining security policies
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 1: Spark SQL Apps
Support row/column-level security with the batch apps
from pyspark.sql import SparkSession
spark = SparkSession 
.builder 
.enableHiveSupport() 
.getOrCreate()
spark.sql("select * from db_common.t_customer").show()
db_common
t_customer
…
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (1/2)
Support row/column-level security in all shells
spark-shell
pyspark
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (2/2)
Support row/column-level security in all shells
sparkR
spark-sql
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 3: Spark Thrift Server
Support row/column-level security with Spark Thrift Server
Login as `hive`
Login as `spark`
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Components
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What are required?
 Kerberos
 Apache Hadoop (HDFS/YARN)
 Apache Ranger
 Apache Hive (LLAP)
 Spark-LLAP: A library and patches to integrate the above
Focus here
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
Provide a standard authorization method across many Hadoop components
https://hortonworks.com/apache/ranger/#section_2
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Hive Ranger Plugin & Policies
– Support row/column-level security
 LLAP Daemon (GA in HDP 2.6)
– Persistent query servers with intelligent in-memory caching
– Provide a secure relational datanode view of the data
Trusted Service
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP for Spark 1.6
• User should use LlapContext
• Support Scala/Java and spark-shell
HDP 2.5
var lc = new LlapContext(sc)
lc.sql("select * from t").show
Spark-LLAP (Technical Preview)
Milestone
Spark-LLAP for Spark 2.1
• No need to rewrite SQL related code
• Support all languages and shells
HDP 2.6 Next
Spark-LLAP for Spark 2.1
• Support YARN cluster mode
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP GitHub (Apache License)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Case: spark-submit with YARN cluster mode
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
Existing InfraNew for Spark
New for Hive (GA)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive
Enable LLAP
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Manage
Hive Database: db_common
Table: *
Hive Column: *
Select User: spark
Permissions: SELECT
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Audit
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User
 spark-submit
--jars spark-llap.jar
--conf spark.sql.hive.llap=true
--conf spark.yarn.security.credentials.hiveserver2.enabled=true
--master yarn
--deploy-mode cluster
sql.py
Launch Spark jobs
Note: There exists more static configurations related LLAP
`--package` option is supported, too
Easy to turn on/off
Only used for YARN cluster mode
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
 HDFS Delegation Token
– HDFSCredentialProvider gets it from namenode
 Hive Metastore Delegation Token
– HiveCredentialProvider gets it from Hive Metastore
 HiveServer2 Delegation Token
– HiveServer2CredentialProvider gets it from HiveServer2
Get delegation tokens
Spark-LLAP
Existing
Note: Spark manages token renewal
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
SELECT gender, count(*)
FROM db_common.t_customer
WHERE name LIKE '%Obama’
GROUP BY gender
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
UnresolvedRelation
Filter: name like %Obama
Parsed Logical Plan
Aggregate: gender
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
Without Spark-LLAP
With Spark-LLAP
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
Scan LlapRelation
PushedFilter:
StringEndsWith(name, Obama)
Filter: EndsWith(name, Obama)
Physical Plan
Project: gender
HashAggregate: gender
…
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
Read filtered and masked data from LLAP
jobConf.set("hive.llap.zk.registry.user", "hive")
jobConf.set("llap.if.hs2.connection", parameters("url"))
jobConf.set("llap.if.query", queryString)
…
// Create Hadoop RDD and convert LLAP Row into Spark Row
sc.sparkContext
.hadoopRDD(…)
.mapPartitionsWithInputSplit(…)
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo (Video)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Some related SPARK Issues
 SPARK-14743 Add a configurable credential manager for Spark running on YARN
 SPARK-15777 Catalog federation (Open)
 SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)
 SPARK-17819 Support default database in connection URIs for Spark Thrift Server
 SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist
 SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.
 SPARK-18857 Don't use `Iterator.duplicate` in STS
 SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems
 SPARK-19038 Avoid overwriting keytab configuration in yarn-client
 SPARK-19179 Change spark.yarn.access.namenodes config and update docs
 SPARK-19970 Table owner should be USER instead of PRINCIPAL
 SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 Support row/column-level security with
– Spark apps
– Spark shells
– Spark Thrift Server
 You can use the existing Spark 2.X SQL apps and scripts
 Easy to turn on/off with only configurations
 Ranger enforces Hive/Spark simultaneously and consistently
Spark-LLAP with HDP 2.6 is TP
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgement
 Apache Hive / Apache Spark / Apache Ranger
 Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and
many others
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you

Contenu connexe

Tendances

Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky   oracle, memory & linuxChristo kutrovsky   oracle, memory & linux
Christo kutrovsky oracle, memory & linux
Kyle Hailey
 

Tendances (20)

Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
MySQL Shell - the best DBA tool !
MySQL Shell - the best DBA tool !MySQL Shell - the best DBA tool !
MySQL Shell - the best DBA tool !
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Oracle Database Vault
Oracle Database VaultOracle Database Vault
Oracle Database Vault
 
Oracle RAC 19c - the Basis for the Autonomous Database
Oracle RAC 19c - the Basis for the Autonomous DatabaseOracle RAC 19c - the Basis for the Autonomous Database
Oracle RAC 19c - the Basis for the Autonomous Database
 
Exadata master series_asm_2020
Exadata master series_asm_2020Exadata master series_asm_2020
Exadata master series_asm_2020
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
MySQL sys schema deep dive
MySQL sys schema deep diveMySQL sys schema deep dive
MySQL sys schema deep dive
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky   oracle, memory & linuxChristo kutrovsky   oracle, memory & linux
Christo kutrovsky oracle, memory & linux
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesOracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Oracle GoldenGate Veridata 12cR2 セットアップガイド
Oracle GoldenGate Veridata 12cR2 セットアップガイドOracle GoldenGate Veridata 12cR2 セットアップガイド
Oracle GoldenGate Veridata 12cR2 セットアップガイド
 
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
オンプレミスからクラウドへ:Oracle Databaseの移行ベストプラクティスを解説 (Oracle Cloudウェビナーシリーズ: 2021年2月18日)
オンプレミスからクラウドへ:Oracle Databaseの移行ベストプラクティスを解説 (Oracle Cloudウェビナーシリーズ: 2021年2月18日)オンプレミスからクラウドへ:Oracle Databaseの移行ベストプラクティスを解説 (Oracle Cloudウェビナーシリーズ: 2021年2月18日)
オンプレミスからクラウドへ:Oracle Databaseの移行ベストプラクティスを解説 (Oracle Cloudウェビナーシリーズ: 2021年2月18日)
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 

Similaire à Row/Column- Level Security in SQL for Apache Spark

Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit
 

Similaire à Row/Column- Level Security in SQL for Apache Spark (20)

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin production
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 

Plus de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Row/Column- Level Security in SQL for Apache Spark

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Row/Column-level Security in SQL for Apache Spark Dongjoon Hyun – Software Engineer Bikas Saha – Software Engineer April 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who am I  Software Engineer @ Hortonworks  Apache REEF PMC member and committer  Apache Spark project contributor  https://github.com/dongjoon-hyun Dongjoon Hyun
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security  One of fundamental features for enterprise adoption – Multi-tenancy: Billing team / Data science team / Marketing teams  Row and column-level access control for SQL users – Row filtering – Column masking  Must enforce shared policies to various SQL engines simultaneously – E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Issue 1  Spark reads all or nothing – Directory/File-based permissions are insufficient  Permission 777 on warehouse? Security starts from storage
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Issue 2  Spark apps should be rewritten – Special data source tables  Duplicated data – Filtered rows – Removed or masked columns  SQL Views – Maintained by manually Overhead during starting and maintaining security policies
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goals
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 1: Spark SQL Apps Support row/column-level security with the batch apps from pyspark.sql import SparkSession spark = SparkSession .builder .enableHiveSupport() .getOrCreate() spark.sql("select * from db_common.t_customer").show() db_common t_customer …
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 2: Spark shells (1/2) Support row/column-level security in all shells spark-shell pyspark
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 2: Spark shells (2/2) Support row/column-level security in all shells sparkR spark-sql
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 3: Spark Thrift Server Support row/column-level security with Spark Thrift Server Login as `hive` Login as `spark`
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Components
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What are required?  Kerberos  Apache Hadoop (HDFS/YARN)  Apache Ranger  Apache Hive (LLAP)  Spark-LLAP: A library and patches to integrate the above Focus here
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger Provide a standard authorization method across many Hadoop components https://hortonworks.com/apache/ranger/#section_2
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Hive Ranger Plugin & Policies – Support row/column-level security  LLAP Daemon (GA in HDP 2.6) – Persistent query servers with intelligent in-memory caching – Provide a secure relational datanode view of the data Trusted Service
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark-LLAP for Spark 1.6 • User should use LlapContext • Support Scala/Java and spark-shell HDP 2.5 var lc = new LlapContext(sc) lc.sql("select * from t").show Spark-LLAP (Technical Preview) Milestone Spark-LLAP for Spark 2.1 • No need to rewrite SQL related code • Support all languages and shells HDP 2.6 Next Spark-LLAP for Spark 2.1 • Support YARN cluster mode
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark-LLAP GitHub (Apache License)
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How it works
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How it works – Overview Case: spark-submit with YARN cluster mode Spark Hive (HiveServer2) Ranger LLAP User Admin 2. Launch 3. Get delegation token 1. Manage policies 7. Monitor Audits 6. Read filtered/masked data Authorize 5. Get data locations 4. Get metadata
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How it works – Overview Spark Hive (HiveServer2) Ranger LLAP User Admin 2. Launch 3. Get delegation token 1. Manage policies 7. Monitor Audits 6. Read filtered/masked data Authorize 5. Get data locations 4. Get metadata Existing InfraNew for Spark New for Hive (GA)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Enable LLAP
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Admin – Manage Hive Database: db_common Table: * Hive Column: * Select User: spark Permissions: SELECT
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Admin – Audit
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User  spark-submit --jars spark-llap.jar --conf spark.sql.hive.llap=true --conf spark.yarn.security.credentials.hiveserver2.enabled=true --master yarn --deploy-mode cluster sql.py Launch Spark jobs Note: There exists more static configurations related LLAP `--package` option is supported, too Easy to turn on/off Only used for YARN cluster mode
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark  HDFS Delegation Token – HDFSCredentialProvider gets it from namenode  Hive Metastore Delegation Token – HiveCredentialProvider gets it from Hive Metastore  HiveServer2 Delegation Token – HiveServer2CredentialProvider gets it from HiveServer2 Get delegation tokens Spark-LLAP Existing Note: Spark manages token renewal
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation SELECT gender, count(*) FROM db_common.t_customer WHERE name LIKE '%Obama’ GROUP BY gender LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender UnresolvedRelation Filter: name like %Obama Parsed Logical Plan Aggregate: gender
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation Without Spark-LLAP With Spark-LLAP
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapRelation supports predicate pushdown during optimization LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapRelation supports predicate pushdown during optimization LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender Scan LlapRelation PushedFilter: StringEndsWith(name, Obama) Filter: EndsWith(name, Obama) Physical Plan Project: gender HashAggregate: gender …
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Read filtered and masked data from LLAP jobConf.set("hive.llap.zk.registry.user", "hive") jobConf.set("llap.if.hs2.connection", parameters("url")) jobConf.set("llap.if.query", queryString) … // Create Hadoop RDD and convert LLAP Row into Spark Row sc.sparkContext .hadoopRDD(…) .mapPartitionsWithInputSplit(…)
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo (Video)
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Some related SPARK Issues  SPARK-14743 Add a configurable credential manager for Spark running on YARN  SPARK-15777 Catalog federation (Open)  SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)  SPARK-17819 Support default database in connection URIs for Spark Thrift Server  SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist  SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.  SPARK-18857 Don't use `Iterator.duplicate` in STS  SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems  SPARK-19038 Avoid overwriting keytab configuration in yarn-client  SPARK-19179 Change spark.yarn.access.namenodes config and update docs  SPARK-19970 Table owner should be USER instead of PRINCIPAL  SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  Support row/column-level security with – Spark apps – Spark shells – Spark Thrift Server  You can use the existing Spark 2.X SQL apps and scripts  Easy to turn on/off with only configurations  Ranger enforces Hive/Spark simultaneously and consistently Spark-LLAP with HDP 2.6 is TP
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgement  Apache Hive / Apache Spark / Apache Ranger  Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and many others
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you