eBay has large, multi-purpose Hadoop clusters with many petabytes of data. eBay Inc?s many subsidiaries and applications give rise to fascinating, complex scenarios for authentication, audit, access control, data protection, data safety, and privacy. To fulfill these complex requirements, security is enabled on Hadoop clusters at eBay. We have been experimenting with and implementing in our production systems several techniques to 1) Set up and operate large clusters (thousands of nodes) efficiently and effectively for a thousand users, and 2)Enable security transparently without impacting user jobs or over-burdening users with inconvenient restrictions Based on our experiments, we have assembled some effective ?best? practices for deploying, operating and using secure Hadoop?clusters. In this presentation, we explain these methods and rules-of-thumb that efficiently and effectively: 1)Set up reliable, scalable Infrastructure to enable strong security for large clusters 2)Set up a very large secure Hadoop cluster in a scalable, zero-touch automated way 3)Update software and configuration on the cluster efficiently while minimizing ?drift? among machines 4)Keep Hadoop services up and running with minimum human intervention 5)Manage user Access to Hadoop clusters? 6)Control access to data and other resources 7)Process highly confidential data using map-reduce programs
3. Cluster Facts
•Shared clusters & dedicated clusters
•10s of PB and 10’s of thousands of slots per cluster
•Runs HDP 1.2
•Used Primarily for analysis of user behavior and inventory
•Mix of production jobs and ad-hoc jobs
•Mix of MR, Hive, Pig, Cascading, Streaming etc.
Secure Hadoop @ eBay 3
4. Why is Security needed at eBay ?
•To control access to sensitive data
– ACLS are ineffective without strong
authentication
•To execute tasks as the Job submitter
•Build new features
– Encryption
Secure Hadoop @ eBay 4
5. Hadoop Security Overview
•Authentication using Kerberos
•Authorization via ACLs.
•Group and user information using
LDAP
•Pluggable authentication for webui
Secure Hadoop @ eBay 5
6. Security Infrastructure @ eBay
•Cluster machines including
Gateway are inside the firewall
•Uses Active Directory for Kerberos
and LDAP
•Separate Domain for users and
Hadoop Servers
CORP
AD
Gate
way
JT
NN
HBM
DN
TT
RS
DN
TT
RS
Hadoop
AD
Secure Hadoop @ eBay 6
7. Advantages of Separate user and Server Domains
•Separates User and Server
Authentication
•Prevents additional Kerberos and
LDAP traffic to Corp Servers
•Hadoop team can manage Hadoop
Server Accounts
CORP AD
Hadoop
AD
Secure Hadoop @ eBay 7
Hadoop
Cluster
Nodes
Server
accounts
User
accounts
8. Syncing Hadoop User Information
•All nodes require User and Group
Information
– Permissions checks
– Running tasks
•Hadoop AD should contain user and
group information
•Periodic synchronization of user
information from CORP AD to
Hadoop AD
– LDAP Synchronization Connector
– User’s password is not synced.
CORP
AD
Gate
way
JT
NN
DN
TT
DN
TT
Hadoop
AD
Secure Hadoop @ eBay 8
LSC
Hadoop groups
Hadoop users
Batch accounts
9. No Cross Domain Trust !
•Modified Hadoop Authentication Layer
– Hadoop Masters have two principals
and corresponding keytabs
•hdfs/namenode@hadoop.ebay.com
•hdfs/namenode@corp.ebay.com
– Loads server principal and key based
on the client
– Require changes in Hadoop, Hbase
and Zookeeper servers.
NN
Hadoop
AD
DN
TT
hdfs/nn@hadoop
hdfs/nn@corp
Secure Hadoop @ eBay 9
CORP AD
Obtain service
ticket for hdfs/nn
Obtain service
ticket for hdfs/nn
10. User Authentication - Obtaining tickets
•Ad-hoc jobs/queries are run using
personal accounts
– PAM module fetches tickets at login
– kinit when tickets expire.
•Production jobs are run using batch
accounts.
– Uses keytabs to obtain tickets
– Automatic ticket renewal using
K5start
– Enabled transparent security rollout
Secure Hadoop @ eBay 10
11. Encrypting Sensitive Data
•Use case
– Copies encrypted data to the cluster.
– Key identifiers passed during job
submission.
– Job Client fetches Keys from Key
Store using user’s credentials
– Key Values protected using Cluster’s
public key
•Work in progress
Key Store
Job Client
Read secrets
JJob, S
Secure Hadoop @ eBay 11
Hadoop
Cluster
12. Direct Access to the cluster
•Current Cluster Access is through
the Gateway machine
•Direct Access to cluster from
Desktops
– The communication should be
encrypted
– Communication inside the
firewall need not be encrypted
•Advantages
– Increases user productivity
– Reduce utilization of Gateway
Gate
wayssh
Secure Hadoop @ eBay 12
Hadoop
Cluster
Auth
Auth+Privacy
13. Summary
•Infrastructure using Active Directory and separate domains
•Authentication across domains without domain trust
•Rollout with minimal disruption
•Additional security features
Secure Hadoop @ eBay 13
15. Why?
•Daemons die from time to time
– We don’t know about it
– Would be nice if we could do something about it in a smart way
•There are different ways to control daemons
– Not portable
– Changes with platform
– Some init scripts are not well-written
– Some ways require sudo
– Caller’s environment can affect how daemon runs
– Some ways don’t handle automatic restarts
Enter process supervision!
Secure Hadoop @ eBay 15
16. What?
•daemontools-encore: a uniform mechanism to control daemons
– Simple command set: svc, svstat, svup, svok
– Supports process state change callback (notify script)
•Alert when a daemon crashes
•Smart restarts (don’t restart if trashing)
– Can be used for one-shot jobs (svc –o)
– Portable, runs on many UNIX versions
– Robust and reliable code (small is beautiful)
– Includes configurable log management
•multilog manages stdout, stderr output
•Never fill up your disks
•Multiple log queues possible (e.g. everything, errors only)
Secure Hadoop @ eBay 16
18. Configuring A Service
•A service consists of a directory: /var/lib/service/foo
•Holds some files and directories:
– start (optional)
– run
– notify (optional)
– stop (optional)
– log/run
– log/main
– env
•To enable a service, put a symlink to it in /service, and svscan will start it:
– ln –s /var/lib/service/foo /service/foo
Secure Hadoop @ eBay 18
19. Sample run Scripts
/service/tasktracker/run
#!/bin/sh
exec 2>&1
# Give the hadoop user access
setfacl -R -m u:hadoop:rwx supervise
exec envdir env setuidgid hadoop /apache/hadoop/bin/hadoop tasktracker
/service/tasktracker/log/run
#!/bin/sh
# Give the hadoop user access
setfacl -R -m u:hadoop:rwx supervise
test -d main || install -o hadoop -d main
exec setuidgid hadoop multilog s10485760 n500 ./main
Secure Hadoop @ eBay 19
20. The env directory
# pwd
/service/datanode/env
# head *
==> HADOOP_DATANODE_OPTS <==
-Dhadoop.log.file.RFA.MaxBackupIndex=500 -Dhadoop.log.file.RFA.MaxFileSize=100MB
==> HADOOP_HOME <==
/apache/hadoop
==> HADOOP_LOG_DIR <==
/apache/hadoop/logs
==> HADOOP_LOGFILE <==
hadoop-hadoop-datanode.log
==> HADOOP_ROOT_LOGGER <==
INFO,RFA
==> HADOOP_SECURE_DN_USER <==
Hadoop
#
Can use echo and rm to edit values!
Secure Hadoop @ eBay 20
21. Service State Commands
Secure Hadoop @ eBay 21
# svstat /service/*
/service/datanode: up (pid 31690) 2774877 seconds, running
/service/gmon: up (pid 24474) 41500 seconds, running
/service/hbase-regionserver: up (pid 3221) 6475035 seconds, running
/service/puppet: up (pid 11917) 2246936 seconds, running
/service/tasktracker: up (pid 28229) 2757029 seconds, running
# svc -t /service/datanode
# sleep 10
# svstat /service/datanode
/service/datanode: up (pid 8203) 10 seconds, running
# svc -d /service/datanode
# svstat /service/datanode
/service/datanode: down 6 seconds, normally up, stopped
# svc -u /service/datanode
# sleep 10
# svstat /service/datanode
/service/datanode: up (pid 9582) 10 seconds, running
#