SlideShare a Scribd company logo
1 of 28
Download to read offline
MapReduce over Tahoe
                     Aaron Cordova
                                Associate       New York
                                               Oct 1, 2009


                    Booz Allen Hamilton Inc.             .
              134 National Business Parkway
               Annapolis Junction, MD 20701
                   cordova_aaron@bah.com




Hadoop World 2009 2009
 Hadoop World NYC
                                                             1
MapReduce over Tahoe
  Impact of data security requirements on large scale analysis

  Introduction to Tahoe

  Integrating Tahoe with Hadoop’s MapReduce

  Deployment scenarios, considerations

  Test results




Hadoop World NYC 2009
                                                                  2
Features of Large Scale Analysis
  As data grows, it becomes harder, more expensive to move
   – “Massive” data

  The more data sets are located together, the more valuable each is
   – Network Effect

  Bring computation to the data




Hadoop World NYC 2009
                                                                        3
Data Security and Large Scale Analysis
  Each department within an organization has its own data

  Some data need to be shared

  Others are protected
                                                             CRM


                                                      Product
                                                                Sales
                                                      Testing




Hadoop World NYC 2009
                                                                        4
Data Security
  Because of security constraints,
   departments tend to setup
   their own data storage and
   processing systems
   independently                       Support      Support      Support      Support
  This includes support staff
                                       Storage      Storage      Storage      Storage
  Highly inefficient
                                      Processing   Processing   Processing   Processing
  Analysis across datasets is
   impossible                           Apps         Apps         Apps         Apps




Hadoop World NYC 2009
                                                                                          5
“Stovepipe Effect”




Hadoop World NYC 2009
                        6
Tahoe - A Least Authority File System
  Release 1.5

  AllMyData.com

  Included in Ubuntu Karmic Koala

  Open Source




Hadoop World NYC 2009
                                        7
Tahoe Architecture
  Data originates at the client, which
   is trusted                                   Storage Servers
  Client encrypts, segments, and
   erasure-codes data

  Segments are distributed to
   storage nodes over encrypted
   links

  Storage nodes only see encrypted
                                          SSL
   data, and are not trusted

                                           Client




Hadoop World NYC 2009
                                                                  8
Tahoe Architecture Features
  AES Encryption

  Segmentation

  Erasure-coding

  Distributed

  Flexible Access Control




Hadoop World NYC 2009
                              9
Erasure Coding Overview

                                                 N




           K
  Only k of n segments are needed to recover the file

  Up to n-k machines can fail, be compromised, or malicious without data loss

  n and k are configurable, and can be chosen to achieve desired availability

  Expansion factor of data is k/n (default is 3/10, or 3.3)



Hadoop World NYC 2009
                                                                                 10
Flexible Access Control
  Each file has a Read Capability and a Write Capability

  These are decryption keys                                ReadCap
                                                    File
  Directories have capabilities too
                                                            WriteCap

                                                            ReadCap
                                                    Dir
                                                            WriteCap




Hadoop World NYC 2009
                                                                       11
Flexible Access Control
  Access to a subset of files can be done by:
   – creating a directory                                    Dir
   – attaching files
   – sharing read or write capabilities of the dir

  Any files or directories attached are accessible

  Any outside the directory are not                  File          Dir   ReadCap




                                                             File         File




Hadoop World NYC 2009
                                                                                    12
Access Control Example

 Files




Directories         /Sales                            /Testing

         Each department can access their own files


Hadoop World NYC 2009
                                                                 13
Access Control Example

 Files




Directories         /Sales                            /Testing

         Each department can access their own files


Hadoop World NYC 2009
                                                                 14
Access Control Example

 Files




Directories      /Sales             /New             /Testing
                                  Products
   Files that need to be shared can be linked to a new directory, whose read
                     capability is given to both departments

Hadoop World NYC 2009
                                                                               15
Hadoop Can Use The Following File Systems
  HDFS

  Cloud Store (KFS)

  Amazon S3

  FTP

  Read only HTTP

  Now, Tahoe!




Hadoop World NYC 2009
                                            16
Hadoop File System Integration HowTo
  Step 1.
   – Locate your favorite file system’s API

  Step 2.
   – subclass FileSystem
   – found in /src/core/org/apache/hadoop/fs/FileSystem.java

  Step 3.
   – Add lines to core-site.xml:
                          <name> fs.lafs.impl </name>
                          <value> your.class </value>

  Step 4.
   – Test using your favorite Infrastructure Service Provider




Hadoop World NYC 2009
                                                                17
Hadoop Integration : MapReduce
  One Tahoe client is run on each         Storage Servers
   machine that serves as a
   MapReduce Worker

  On average, clients
   communicate with k storage
   servers

  Jobs are limited by aggregate
   network bandwidth

  MapReduce workers are trusted,
   storage nodes are not             Hadoop Map Reduce Workers




Hadoop World NYC 2009
                                                                 18
Hadoop-Tahoe Configuration
  Step 1. Start Tahoe

  Step 2. Create a new directory in Tahoe, note the WriteCap

  Step 3. Configure core-site.xml thus:
   – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS
   – lafs.rootcap: $WRITE_CAP
   – fs.default.name: lafs://localhost

  Step 4. Start MapReduce, but not HDFS




Hadoop World NYC 2009
                                                                19
Deployment Scenario - Large Organization
  Within a datacenter,                        Storage Servers
   departments can run
   MapReduce jobs on discrete
   groups of compute nodes

  Each MapReduce job accesses
   a directory containing a subset
   of files

  Results are written back to the
   storage servers, encrypted

                                            Sales            Audit

                                     MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                         20
Deployment Scenario - Community
  If a community uses a shared                   Storage Servers
   data center, different
   organizations can run discrete
   MapReduce jobs

  Perhaps most importantly, when
   results are deemed appropriate
   to share, access can be granted
   simply by sending a read or
   write capability

  Since the data are all co-located
   already, no data needs to be
   moved
                                            FBI         Homeland Sec

                                       MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                           21
Deployment Scenario - Public Cloud Services
  Since storage nodes require no              Storage Servers
   trust, they can be located at a
   remote location, e.g. within a
   cloud service provider’s
   datacenter
                                           Cloud Service Provider
  MapReduce jobs can be done
   this way if bandwidth to the
   datacenter is adequate




                                     MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                         22
Deployment Scenario - Public Cloud Services
  For some users, everything                 Storage Servers
   could be run remotely in a
   service provider’s data center

  There are a few caveats and
   additional precautions in this
   scenario:

                                          Cloud Service Provider




                                    MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                        23
Public Cloud Deployment Considerations
  Store configuration files in memory

  Encrypt / disable swap

  Encrypt spillover
                                         Cloud Service Provider
  Must trust memory / hypervisor

  Trust service provider disks




Hadoop World NYC 2009
                                                                  24
HDFS and Linux Disk Encryption Drawbacks
  At most one key per node - no support for flexible access control

  Decryption done at the storage node rather than at the client - still have to trust storage nodes




Hadoop World NYC 2009
                                                                                                       25
Tahoe and HDFS - Comparison

          Feature             HDFS              Tahoe

       Confidentiality    File Permissions   AES Encryption

          Integrity        Checksum         Merkel Hash Tree

         Availability      Replication      Erasure Coding

      Expansion Factor         3x              3.3x (k/n)

        Self-Healing       Automatic           Automatic

      Load-balancing       Automatic            Planned

        Mutable Files          No                 Yes


Hadoop World NYC 2009
                                                               26
Performance           HDFS            Tahoe

  Tests run on ten nodes

  RandomWrite writes 1 GB per
                                                                   200
   node

  WordCount done over randomly                                    150
   generated text

  Tahoe write speed is 10x slower
                                                                   100
  Read-intensive jobs are about
   the same
                                                                   50
  Not so bad since the most
   common data use case is write-
   once, read-many                                                 0
                                     Random Write   Word Count

Hadoop World NYC 2009
                                                                          27
Code
  Tahoe available from http://allmydata.org
   – Licensed under GPL 2 or TGPPL

  Integration code available at http://hadoop-lafs.googlecode.com
   – Licensed under Apache 2




Hadoop World NYC 2009
                                                                     28

More Related Content

What's hot

Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterEdureka!
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011GlusterFS
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookDataWorks Summit
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseDataWorks Summit
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

What's hot (20)

Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
dbaas-clone
dbaas-clonedbaas-clone
dbaas-clone
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at Facebook
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Similar to Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersMrigendra Sharma
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopAditi Yadav
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 

Similar to Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem (20)

App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
 
Hadoop
HadoopHadoop
Hadoop
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

  • 1. MapReduce over Tahoe Aaron Cordova Associate New York Oct 1, 2009 Booz Allen Hamilton Inc. . 134 National Business Parkway Annapolis Junction, MD 20701 cordova_aaron@bah.com Hadoop World 2009 2009 Hadoop World NYC 1
  • 2. MapReduce over Tahoe  Impact of data security requirements on large scale analysis  Introduction to Tahoe  Integrating Tahoe with Hadoop’s MapReduce  Deployment scenarios, considerations  Test results Hadoop World NYC 2009 2
  • 3. Features of Large Scale Analysis  As data grows, it becomes harder, more expensive to move – “Massive” data  The more data sets are located together, the more valuable each is – Network Effect  Bring computation to the data Hadoop World NYC 2009 3
  • 4. Data Security and Large Scale Analysis  Each department within an organization has its own data  Some data need to be shared  Others are protected CRM Product Sales Testing Hadoop World NYC 2009 4
  • 5. Data Security  Because of security constraints, departments tend to setup their own data storage and processing systems independently Support Support Support Support  This includes support staff Storage Storage Storage Storage  Highly inefficient Processing Processing Processing Processing  Analysis across datasets is impossible Apps Apps Apps Apps Hadoop World NYC 2009 5
  • 7. Tahoe - A Least Authority File System  Release 1.5  AllMyData.com  Included in Ubuntu Karmic Koala  Open Source Hadoop World NYC 2009 7
  • 8. Tahoe Architecture  Data originates at the client, which is trusted Storage Servers  Client encrypts, segments, and erasure-codes data  Segments are distributed to storage nodes over encrypted links  Storage nodes only see encrypted SSL data, and are not trusted Client Hadoop World NYC 2009 8
  • 9. Tahoe Architecture Features  AES Encryption  Segmentation  Erasure-coding  Distributed  Flexible Access Control Hadoop World NYC 2009 9
  • 10. Erasure Coding Overview N K  Only k of n segments are needed to recover the file  Up to n-k machines can fail, be compromised, or malicious without data loss  n and k are configurable, and can be chosen to achieve desired availability  Expansion factor of data is k/n (default is 3/10, or 3.3) Hadoop World NYC 2009 10
  • 11. Flexible Access Control  Each file has a Read Capability and a Write Capability  These are decryption keys ReadCap File  Directories have capabilities too WriteCap ReadCap Dir WriteCap Hadoop World NYC 2009 11
  • 12. Flexible Access Control  Access to a subset of files can be done by: – creating a directory Dir – attaching files – sharing read or write capabilities of the dir  Any files or directories attached are accessible  Any outside the directory are not File Dir ReadCap File File Hadoop World NYC 2009 12
  • 13. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 13
  • 14. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 14
  • 15. Access Control Example Files Directories /Sales /New /Testing Products Files that need to be shared can be linked to a new directory, whose read capability is given to both departments Hadoop World NYC 2009 15
  • 16. Hadoop Can Use The Following File Systems  HDFS  Cloud Store (KFS)  Amazon S3  FTP  Read only HTTP  Now, Tahoe! Hadoop World NYC 2009 16
  • 17. Hadoop File System Integration HowTo  Step 1. – Locate your favorite file system’s API  Step 2. – subclass FileSystem – found in /src/core/org/apache/hadoop/fs/FileSystem.java  Step 3. – Add lines to core-site.xml: <name> fs.lafs.impl </name> <value> your.class </value>  Step 4. – Test using your favorite Infrastructure Service Provider Hadoop World NYC 2009 17
  • 18. Hadoop Integration : MapReduce  One Tahoe client is run on each Storage Servers machine that serves as a MapReduce Worker  On average, clients communicate with k storage servers  Jobs are limited by aggregate network bandwidth  MapReduce workers are trusted, storage nodes are not Hadoop Map Reduce Workers Hadoop World NYC 2009 18
  • 19. Hadoop-Tahoe Configuration  Step 1. Start Tahoe  Step 2. Create a new directory in Tahoe, note the WriteCap  Step 3. Configure core-site.xml thus: – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS – lafs.rootcap: $WRITE_CAP – fs.default.name: lafs://localhost  Step 4. Start MapReduce, but not HDFS Hadoop World NYC 2009 19
  • 20. Deployment Scenario - Large Organization  Within a datacenter, Storage Servers departments can run MapReduce jobs on discrete groups of compute nodes  Each MapReduce job accesses a directory containing a subset of files  Results are written back to the storage servers, encrypted Sales Audit MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 20
  • 21. Deployment Scenario - Community  If a community uses a shared Storage Servers data center, different organizations can run discrete MapReduce jobs  Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability  Since the data are all co-located already, no data needs to be moved FBI Homeland Sec MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 21
  • 22. Deployment Scenario - Public Cloud Services  Since storage nodes require no Storage Servers trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter Cloud Service Provider  MapReduce jobs can be done this way if bandwidth to the datacenter is adequate MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 22
  • 23. Deployment Scenario - Public Cloud Services  For some users, everything Storage Servers could be run remotely in a service provider’s data center  There are a few caveats and additional precautions in this scenario: Cloud Service Provider MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 23
  • 24. Public Cloud Deployment Considerations  Store configuration files in memory  Encrypt / disable swap  Encrypt spillover Cloud Service Provider  Must trust memory / hypervisor  Trust service provider disks Hadoop World NYC 2009 24
  • 25. HDFS and Linux Disk Encryption Drawbacks  At most one key per node - no support for flexible access control  Decryption done at the storage node rather than at the client - still have to trust storage nodes Hadoop World NYC 2009 25
  • 26. Tahoe and HDFS - Comparison Feature HDFS Tahoe Confidentiality File Permissions AES Encryption Integrity Checksum Merkel Hash Tree Availability Replication Erasure Coding Expansion Factor 3x 3.3x (k/n) Self-Healing Automatic Automatic Load-balancing Automatic Planned Mutable Files No Yes Hadoop World NYC 2009 26
  • 27. Performance HDFS Tahoe  Tests run on ten nodes  RandomWrite writes 1 GB per 200 node  WordCount done over randomly 150 generated text  Tahoe write speed is 10x slower 100  Read-intensive jobs are about the same 50  Not so bad since the most common data use case is write- once, read-many 0 Random Write Word Count Hadoop World NYC 2009 27
  • 28. Code  Tahoe available from http://allmydata.org – Licensed under GPL 2 or TGPPL  Integration code available at http://hadoop-lafs.googlecode.com – Licensed under Apache 2 Hadoop World NYC 2009 28