For Impetus’ White Papers archive, visit- http://lf1.me/drb/
This white paper talks about the design considerations for enterprises to run Hadoop as a shared service for multiple departments.
As Hadoop becomes more mainstream and indispensable to enterprises, it is imperative that they build, operate and scale shared Hadoop clusters. The design considerations discussed in this paper will help enterprises accomplish the essential mission of running multi-tenant, multi-use Hadoop clusters at scale.
The white paper talks about Identity, Security, Resource Sharing, Monitoring and Operations on the Central Service.
For Impetus’ White Papers archive, visit- http://lf1.me/drb/
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper
1. The Shared
Elephant
A Shared Central Big Data Repository
This white paper talks about
the design considerations for enterprises
to run Hadoop as a shared service
for multiple departments.
www.impetus.com
2. Introduction
Running an Enterprise Big Data repository requires significant investment in
Learn about the
considerations for
Enterprises to use
Hadoop as a shared
service for multiple
applications and
business units.
Read about Identity,
Security, Resource
Sharing, Monitoring
and Operations on
the Central Service.
resources. A dedicated cluster for each department is cost-prohibitive, leading
to the creation of Big Data silos and underutilization of cluster resources.
Enterprises that run Hadoop at scale should allow Hadoop clusters to be
shared by different business units. They must also support multiple use cases
as well as a checkin/checkout model for an analytic block of works. We cover
some design considerations for identity management, security, resource
sharing and monitoring that are essential to build a secure, robust, highly
available and shared central Big Data repository.
Identity
Security is of paramount concern in a shared, multi-tenant environment. Early
versions of Hadoop had rudimentary security features, essentially relying on a
fair use policy in a trusted environment. Recent versions of Hadoop have
added significant identity management features. Let us explore a couple of
these in detail.
Kerberos
Kerberos provides authentication and authorization services. The Kerberos
mechanism provides stronger authentication in a more secure fashion than
what was available in earlier versions of Hadoop. All clients have to
authenticate with a central Kerberos service. Kerberos provides
role-based access control and privilege enforcement.
Kerberos enforces authentication of data node daemons with the parent
services (name node and job tracker). Authentication prevents rogue data
nodes from connecting to the parent services and compromising the data
stored in the cluster. (Refer to the figure below that demonstrates how Hadoop
Kerberos
Key Distribution Center
Kerberos Authentication works.)
Authentication Service
Request
Session Ticket
Session Ticket
& Session Key
Data Node
Name Node
Task Tracker
HDFS Layer
HDFS
Data
Tenant 1
Job Tracker
Task Tracker
M/R Layer
Tenant 2
Data Node
Parent
Services
Task Tracker
HDFS
Data
Tenant 3
Hadoop Cluster
2
Data Node
Hadoop Kerberos Authentication
HDFS
Data
3. Lightweight Directory Access Protocol (LDAP) Integration
LDAP can be used to create user accounts in all of the data nodes. This
provides fine-grained access control policies and prevents privilege escalation
attacks.
Security
Hadoop has several security features as listed below:
• Running data node daemons on privileged ports.
• Running tasks as the job owner instead of the task tracker daemon user.
This prevents other users from changing the job and also viewing the local
task data.
• Preventing users other than the job owner to look at map outputs.
• Restricting a task to only communicate with its parent task tracker to prevent
rogue users from inspecting map input data.
Data Security
Hadoop does not natively integrate with data-at-rest encryption solutions.
However, the Intel distribution of Hadoop provides fast encryption using Intel
hardware enhancements. Hadoop 2.0 provides SSL transport between
Hadoop daemons and during the shuffle phase.
Sharing Resources
Allocating shared resources to different users and groups in a fair and efficient
manner poses some unique challenges in Hadoop. Hadoop does not provide
policies and SLAs that are typical of shared systems. Hadoop presents the
storage layer (HDFS) as a single shared resource but the computational layer
(MapReduce) requires some fine-tuning for optimal results. Nevertheless, here
are some recommendations on running a user-friendly shared Hadoop cluster.
Resource Usage Limits
• HDFS Quotas: HDFS provides name quotas and file quotas. Both are very
useful to enforce sensible limits on HDFS usage. Designing a sensible shared
directory structure is important, since quotas are set at either file or directory
level. It is a good practice to have a common directory that is shared across
groups and separate quota-limited directories for each group in a shared
cluster.
• Task Slots: Task slots are configured on a per node basis. They take into
account the total capacity of the cluster. Individual jobs are then monitored to
determine the number of mappers. A multiple of the number of map slots is
the recommended practice.
3
4. Scheduling
Hadoop provides different schedulers as plug-ins. That said, not all schedulers
are created equal. The FIFO scheduler should not be used as it can lead to
significant resource underutilization and job starving. The fair scheduler is a
good option for a dedicated cluster but may lead to resource contention in a
shared environment. The capacity scheduler is the optimal choice for a shared
cluster. The capacity scheduler provides multi-tenancy controls that prevent a
user or a group of users from overwhelming the cluster.
It also provides capacity guarantees through soft limits and enforceable hard
limits. The capacity scheduler additionally improves security by providing
ACLs for job queues.
Monitoring
Hadoop provides good monitoring options. We recommend using Ganglia or
similar monitoring for production clusters. JMX monitoring should also be
enabled. Recent versions of Hadoop ship with the more flexible metrics2
framework for metrics collection. Using metrics2 in the Ganglia context
provides valuable insight into cluster usage. Oozie workflows also enables
SLA tracking, which is important for a shared cluster.
Operations
We have discussed several operational considerations such as security,
optimal resource sharing and monitoring. In addition to these, the operations
team needs to build a proactive ‘service’ approach that addresses the full
range of service components present in a Hadoop environment. Each of these
components is a potential point of failure. Operations needs to shift from
passive monitoring to actively meeting SLAs in a new distributed environment.
This shift in focus necessitates a new organizational culture in addition to
operational excellence.
Operational Excellence
Operational excellence for a shared cluster is not just about cluster health and
uptime. Service metrics such as job completion rate, resource sharing and
meeting SLAs is also significant. It is important to operationalize the aspects
of identity, security, resource sharing and monitoring discussed above.
To accomplish these, Hadoop operations need to perform regular audits, fire
drills and ensure well documented processes and procedures. A runbook-based
troubleshooting guide and well formulated support levels (Level 1, Level 2, and
Level 3) with an easy escalation procedure are also required. If SLAs mandate
limited service interruption, then the runbooks should have maximum resolution
times and mandatory escalation based on severity and time-sensitive resolution.
Operational excellence is a function of all of the above.
4