This document discusses security challenges with Hadoop and big data. It summarizes current security features in Hadoop like Kerberos authentication and access control lists. However, more work is still needed for enterprise-grade security, including policy management, fine-grained authorization, and auditing. The XA Secure solution aims to address these challenges with a centralized policy and governance layer across Hadoop components to enable features like data protection, delegated administration, and compliance reporting. XA Secure focuses on providing an end-to-end security suite for big data that is distribution-agnostic and can integrate with existing systems.
XA Secure | Whitepaper on data security within Hadoop
1. SECURE YOUR DATA IN
HADOOP
Current state of security, approach for
comprehensive strategy
2. 1
CONTENTS
Introduction ........................................................................................................................................................ 2
Big data- What is happening?........................................................................................................................ 2
Hadoop- Security .............................................................................................................................................. 3
Current Hadoop Security Features/Initiatives .......................................................................................... 3
Work to be done in Hadoop ....................................................................................................................... 4
XA Secure - Big Data Security Approach ..................................................................................................... 4
XA Secure differentiators ............................................................................................................................. 5
Summary ............................................................................................................................................................. 5
www.xasecure.com| +1.510.585.3289|7100 Stevenson Blvd Fremont CA 94538
3. 2
INTRODUCTION
Big data is emerging as the next technology wave and enterprises across different
industries are adopting tools such as Hadoop. While there are efficiencies in processing
varied and distributed data, big data presents a unique challenge for managing
information security.
BIG DATA- WHAT IS HAPPENING?
Digital data is everywhere and global data is growing at 40% per year. Companies are
capturing trillions of bytes of information about their customers, suppliers, and
operations, and millions of networked sensors are being embedded in the physical
world in devices such as mobile phones, energy meters and automobiles, sensing,
creating, and communicating data. By collecting and analyzing all this information,
companies can gain insight into new business opportunities and threats. To harness the
ever expanding data volumes, new technologies have emerged to enable processing
of massive sets of data in a technique called massive parallel processing (mpp). In a
recent survey by Talend, it was found 60% of companies looking at big data are
considering open source Apache Hadoop or Hadoop based distributions.
From its initial development to supporting Yahoo’s increasing search and web
management needs, Hadoop has emerged as the leading platform to support big
data analytics applications. Hadoop software market itself is predicted to be around
$813 million by 2016(IDC research). Enterprises are moving to a phase whether they
have completed pilot or proof of concept work and embracing Hadoop to solve core
business needs in production.
At the same time, organizations are trying to analyze different kinds of data, from web
logs, social media streams to sales and customer information to get better insights. With
Hadoop, they are able to achieve this at a fraction of a cost compared to traditional
data warehouses. There is a movement towards creating large data lakes or data hubs
where enterprise wide can be stored and processed using Hadoop.
Therein presents the risk of data security, as data moves from protected walls of
enterprise applications to the kitchen sink called Big Data. Organizations need to
provide the same level of security across their organization. Data within big data
initiatives are no exception.
www.xasecure.com| +1.510.585.3289|7100 Stevenson Blvd Fremont CA 94538
4. 3
HADOOP- SECURITY
Hadoop was developed to process massive amounts of disparate data using
commodity hardware. From its initial success in Yahoo, it has matured as an application
to support various verticals. However, the security controls inside Hadoop are very basic
and still evolving.
CURRENT HADOOP SECURITY FE ATURES/INITIATIVES
Given the security challenges, there has been lot of work being undertaken within the
open source and vendor community to enable Hadoop to be a more secure
environment. We have summarized some of the important initiatives
Kerberos Authentication: As one of the first steps towards security, Kerberos
authentication was introduced in Hadoop in 2008 to add a basic level of security that
was missing before and today it is the primary method for providing secure
authentication in Hadoop. Kerberos is a computer network authentication protocol
which works on the basis of "tickets" to allow nodes communicating over a non-secure
network to prove their identity to one another in a secure manner. Kerberos
authentication enables the MapReduce jobs or Namenode tracker in Hadoop to
authenticate the user and enabling permissions based on that
Access Control Lists (ACLs): In core HDFS, file permissions are similar to permission in a
UNIX system. Read-write access is maintained for each user groups which are basically
a string of characters. At the MapReduce level, which users can be used to submit jobs
can be defined by MapReduce ACLs. The list of users groups can be maintained with
the Hadoop layer or can be configured to get it from external LDAP or Active directory
systems. HBase ACLs were introduced from HBase 0.92 onward and gives the ability to
define authorization policy (Read/Write/Create/Admin), with table/family/qualifier
granularity, for a specified user.
Sentry (Cloudera): Cloudera recently introduced role-based authorization framework
which provides access to user and groups over Hive and Cloudera’s Impala. The
authorization framework uses a file based policy provider and can be configured at
multiple levels i.e., server, database, table, column etc.
Project Knox (Hortonworks): Project Knox from Hortonworks is currently focused on
providing a gateway to the Hadoop clusters, to provide a single point of authentication
and access for Apache Hadoop services in a cluster. Features planned include
providing perimeter security for Hadoop, single cluster end point for data and jobs,
management of security across multiple clusters and Hadoop versions among other
areas. The initiative, started in 2013, has already delivered couple of releases
www.xasecure.com| +1.510.585.3289|7100 Stevenson Blvd Fremont CA 94538
5. 4
WORK TO BE DONE IN H ADOOP
There is a long way to go before Hadoop can meet the exacting security standards in
large enterprises. Despite the current work, there are still some challenges for CIOs and
CISOs adopting the Hadoop stack, including
No framework for managing enterprise policies. Large enterprises have complex
and constantly evolving policies for managing data access. The native Hadoop
framework does not offer an easy framework for customizing and managing
employee policies.
Fine grained authorization. The current authorization lets user or user groups get
access to tables or file systems/directories as a whole. Enterprises are looking for
more fine grained authorization to ensure sensitive data is protected from access
while still be able to analyze complete set of data and leveraging its full potential
Decentralizing data ownership. As the use of Hadoop expands in the organization,
business units would still want to retain control of their data and provide access
themselves to users from other units.
Lack of uniform authorization method. While HBase uses ACLs for managing
authorization, HDFS nodes refer to its own set of groups defined for vetting
authorization. Enterprises are looking for a universal process for authorization across
all components.
Lack of universal audit control mechanism. Currently each component is built to
have its own audit tracking mechanism and there is no uniformity in elements
tracked or format of the audit log. Enterprise are looking for easy way of reporting
access history of their employees
Lack of reporting and governance capabilities. Enterprises would need tools to
readily report policy status, access history and check compliance conformance
across various assets.
XA SECURE - BIG DATA SECURITY APPROACH
At XA Secure, we recognize these challenges for Hadoop and other big data tools, and
are trying to solve them through our solution offerings. Our initial product is completely
built ground up for the big data infrastructure. We are trying to address some of the
security challenges with Hadoop infrastructure by providing a governance layer to
enable
a) Centralized policy management with ability to define policies for fine grained
access controls to files (HDFS), column families, cells (Hbase, Hive) etc,
Differentiated views of data based on user function
b) Protect sensitive data through masking and encryption
www.xasecure.com| +1.510.585.3289|7100 Stevenson Blvd Fremont CA 94538
6. 5
c) Common extensive audit layer across Hadoop components. Audit can be set at
resource and user group level.
d) Delegated administration of data
e) Policy analytics to monitor and report access, enable compliance conformance
The tool is currently built over HBase, Hive and HDFS components with planned
incorporation of other big data tools, such as Greenplum, Mongo DB, in the future
releases.
XA SECURE DIFFERENTIATORS
As noted before, there is a lot of work being done in making Hadoop more secure and
at XA Secure, we continue to work with the open source community in leveraging the
collective work and delivering value to our customers. As a company with a rich history
in security and identity management, and being purely focused on big data, we
believe we bring unique value proposition through our offerings, which includes
a) An end to end complete access management and governance suite over
Hadoop. We focus on making it easier for both business users as well as
administrators to manage data security over Hadoop
b) Distribution agnostic solution. We support most of the prevalent Hadoop
distributions and can easily integrate into management tools that come as part
of the distribution
c) Hooks to integrate with enterprise’s existing provisioning or access management
systems. We currently integrate with LDAP, and also support import and export of
our policies
d) Industry specific compliance and audit reports. We are building support for
government, financial and healthcare compliance requirements.
e) Leverage and built over current open sources efforts on authentication and
encryption. We will continue to embed other open source initiatives as they are
released
SUMMARY
The big data ecosystem is evolving and there are a lot of initiatives in the open source
and vendor community for building mature capabilities, It is important that enterprises
embed security strategy as part of their plan early and think about what data they
would put into big data tools and how they are going to extend the security controls
over the data. CISOs can choose to adopt XA Secure’s solution to provide enterprise
level security and credibility to their big data initiatives.
www.xasecure.com| +1.510.585.3289|7100 Stevenson Blvd Fremont CA 94538