As enterprises around the world bring more of their sensitive data into Hadoop data lakes, balancing the need for democratization of access to data without sacrificing strong security principles becomes paramount. In this webinar, Srikanth Venkat, director of product management for security & governance will demonstrate two new data protection capabilities in Apache Ranger – dynamic column masking and row level filtering of data stored in Apache Hive. These features have been introduced as part of HDP 2.5 platform release.
The Ranger Admin portal is the central interface for security administration. Users can create and update policies, which are then stored in a policy database. Plugins within each component poll these policies at regular intervals. The portal also consists of an audit server that sends audit data collected from the plugins for storage in HDFS or in a relational database.
Ranger plugins: Plugins are lightweight Java programs which embed within processes of each cluster component. For example, the Apache Ranger plugin for Apache Hive is embedded within Hiveserver2.These plugins pull in policies from a central server and store them locally in a file. When a user request comes through the component, these plugins intercept the request and evaluate it against the security policy. Plugins also collect data from the user request and follow a separate thread to send this data back to the audit server.
User group sync: Apache Ranger provides a user synchronization utility to pull users and groups from Unix or from LDAP or Active Directory. The user or group information is stored within Ranger portal and used for policy definition
Row level filtering brings convenience to apps running on Hive. By moving the access restriction logic down into the Hive layer, Hive applies the access restrictions every time that data access is attempted, helping simplify authoring of the query and bringing in seamless behind the scenes enforcement of row level segmentation without having to add this logic to the predicate of the query
Dynamic data masking via Apache Ranger enables security administrators to ensure that only authorized users can see the data they are permitted to see, while for other users or groups the same data is masked or anonymized to protect sensitive content.
Interactive query: Low latency interactive query, persistent servers ready to process SQL
Intelligent in-memory caching
Builds on Hive engine + SQL capabilities
Long running processes
Ability to read from HDFS/S3, cache and serve it out
Open interfaces/composable interfaces to read data
Extensible interfaces to have Spark to read data out of LLAP and process
Rely on LLAP that delivers trusted security
Client side mechanisms can be circumvented so we have focused on server side enforcement of security
Spark has its own exec engine and SQL dialect – so it needs to be able to deal w/ data in a raw manner
Delegate all runtime and execution to Spark itself
Spark plugin called LLAP context (aware of LLAP daemon, how to read data from LLAP daemon, & aware of Ranger query transformations)
Spark SQL issue query, routed to HiveServer 2 into Ranger, Returns split locations
Data read in based on split locations in parallel with assigned plan, Ranger applies query transformation to provide column masking and row filtering
Then Spark is free to
LLAP is trusted daemon