Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem
1. MapReduce over Tahoe
Aaron Cordova
Associate New York
Oct 1, 2009
Booz Allen Hamilton Inc. .
134 National Business Parkway
Annapolis Junction, MD 20701
cordova_aaron@bah.com
Hadoop World 2009 2009
Hadoop World NYC
1
2. MapReduce over Tahoe
Impact of data security requirements on large scale analysis
Introduction to Tahoe
Integrating Tahoe with Hadoop’s MapReduce
Deployment scenarios, considerations
Test results
Hadoop World NYC 2009
2
3. Features of Large Scale Analysis
As data grows, it becomes harder, more expensive to move
– “Massive” data
The more data sets are located together, the more valuable each is
– Network Effect
Bring computation to the data
Hadoop World NYC 2009
3
4. Data Security and Large Scale Analysis
Each department within an organization has its own data
Some data need to be shared
Others are protected
CRM
Product
Sales
Testing
Hadoop World NYC 2009
4
5. Data Security
Because of security constraints,
departments tend to setup
their own data storage and
processing systems
independently Support Support Support Support
This includes support staff
Storage Storage Storage Storage
Highly inefficient
Processing Processing Processing Processing
Analysis across datasets is
impossible Apps Apps Apps Apps
Hadoop World NYC 2009
5
7. Tahoe - A Least Authority File System
Release 1.5
AllMyData.com
Included in Ubuntu Karmic Koala
Open Source
Hadoop World NYC 2009
7
8. Tahoe Architecture
Data originates at the client, which
is trusted Storage Servers
Client encrypts, segments, and
erasure-codes data
Segments are distributed to
storage nodes over encrypted
links
Storage nodes only see encrypted
SSL
data, and are not trusted
Client
Hadoop World NYC 2009
8
9. Tahoe Architecture Features
AES Encryption
Segmentation
Erasure-coding
Distributed
Flexible Access Control
Hadoop World NYC 2009
9
10. Erasure Coding Overview
N
K
Only k of n segments are needed to recover the file
Up to n-k machines can fail, be compromised, or malicious without data loss
n and k are configurable, and can be chosen to achieve desired availability
Expansion factor of data is k/n (default is 3/10, or 3.3)
Hadoop World NYC 2009
10
11. Flexible Access Control
Each file has a Read Capability and a Write Capability
These are decryption keys ReadCap
File
Directories have capabilities too
WriteCap
ReadCap
Dir
WriteCap
Hadoop World NYC 2009
11
12. Flexible Access Control
Access to a subset of files can be done by:
– creating a directory Dir
– attaching files
– sharing read or write capabilities of the dir
Any files or directories attached are accessible
Any outside the directory are not File Dir ReadCap
File File
Hadoop World NYC 2009
12
13. Access Control Example
Files
Directories /Sales /Testing
Each department can access their own files
Hadoop World NYC 2009
13
14. Access Control Example
Files
Directories /Sales /Testing
Each department can access their own files
Hadoop World NYC 2009
14
15. Access Control Example
Files
Directories /Sales /New /Testing
Products
Files that need to be shared can be linked to a new directory, whose read
capability is given to both departments
Hadoop World NYC 2009
15
16. Hadoop Can Use The Following File Systems
HDFS
Cloud Store (KFS)
Amazon S3
FTP
Read only HTTP
Now, Tahoe!
Hadoop World NYC 2009
16
17. Hadoop File System Integration HowTo
Step 1.
– Locate your favorite file system’s API
Step 2.
– subclass FileSystem
– found in /src/core/org/apache/hadoop/fs/FileSystem.java
Step 3.
– Add lines to core-site.xml:
<name> fs.lafs.impl </name>
<value> your.class </value>
Step 4.
– Test using your favorite Infrastructure Service Provider
Hadoop World NYC 2009
17
18. Hadoop Integration : MapReduce
One Tahoe client is run on each Storage Servers
machine that serves as a
MapReduce Worker
On average, clients
communicate with k storage
servers
Jobs are limited by aggregate
network bandwidth
MapReduce workers are trusted,
storage nodes are not Hadoop Map Reduce Workers
Hadoop World NYC 2009
18
19. Hadoop-Tahoe Configuration
Step 1. Start Tahoe
Step 2. Create a new directory in Tahoe, note the WriteCap
Step 3. Configure core-site.xml thus:
– fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS
– lafs.rootcap: $WRITE_CAP
– fs.default.name: lafs://localhost
Step 4. Start MapReduce, but not HDFS
Hadoop World NYC 2009
19
20. Deployment Scenario - Large Organization
Within a datacenter, Storage Servers
departments can run
MapReduce jobs on discrete
groups of compute nodes
Each MapReduce job accesses
a directory containing a subset
of files
Results are written back to the
storage servers, encrypted
Sales Audit
MapReduce Workers / Tahoe Clients
Hadoop World NYC 2009
20
21. Deployment Scenario - Community
If a community uses a shared Storage Servers
data center, different
organizations can run discrete
MapReduce jobs
Perhaps most importantly, when
results are deemed appropriate
to share, access can be granted
simply by sending a read or
write capability
Since the data are all co-located
already, no data needs to be
moved
FBI Homeland Sec
MapReduce Workers / Tahoe Clients
Hadoop World NYC 2009
21
22. Deployment Scenario - Public Cloud Services
Since storage nodes require no Storage Servers
trust, they can be located at a
remote location, e.g. within a
cloud service provider’s
datacenter
Cloud Service Provider
MapReduce jobs can be done
this way if bandwidth to the
datacenter is adequate
MapReduce Workers / Tahoe Clients
Hadoop World NYC 2009
22
23. Deployment Scenario - Public Cloud Services
For some users, everything Storage Servers
could be run remotely in a
service provider’s data center
There are a few caveats and
additional precautions in this
scenario:
Cloud Service Provider
MapReduce Workers / Tahoe Clients
Hadoop World NYC 2009
23
24. Public Cloud Deployment Considerations
Store configuration files in memory
Encrypt / disable swap
Encrypt spillover
Cloud Service Provider
Must trust memory / hypervisor
Trust service provider disks
Hadoop World NYC 2009
24
25. HDFS and Linux Disk Encryption Drawbacks
At most one key per node - no support for flexible access control
Decryption done at the storage node rather than at the client - still have to trust storage nodes
Hadoop World NYC 2009
25
27. Performance HDFS Tahoe
Tests run on ten nodes
RandomWrite writes 1 GB per
200
node
WordCount done over randomly 150
generated text
Tahoe write speed is 10x slower
100
Read-intensive jobs are about
the same
50
Not so bad since the most
common data use case is write-
once, read-many 0
Random Write Word Count
Hadoop World NYC 2009
27
28. Code
Tahoe available from http://allmydata.org
– Licensed under GPL 2 or TGPPL
Integration code available at http://hadoop-lafs.googlecode.com
– Licensed under Apache 2
Hadoop World NYC 2009
28