SlideShare une entreprise Scribd logo
1  sur  26
Hadoop as a Service


Jun Ping Du
Richard McDougall
VMware, Inc.




                      © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization


1. Reduce the Complexity      2. Dramatically Lower         3. Enable Flexible, Agile
                                     Costs                     IT Service Delivery
     to simplify operations   to redirect investment into   to meet and anticipate the
        and maintenance        value-add opportunities        needs of the business




 2
Infrastructure, Apps and now Data…




                            Build    Run
     Private
               Public


                                Manage



Simplify Infrastructure   Simplify App Platform
                                                   Next Trend:
     With Cloud              Through PaaS
                                                  Simplify Data



 3
Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored                                          20 Zetta by 2015

                                                                        1 Yotta by 2030

                                                                        Yes, you are part
                                                                        of the yotta
                                                        audio           generation…
                                                  digital tv
                                               digital photos
                                       camera phones, rfid
                                  medical imaging, sensors
                  satellite images, games, scanners, twitter
       cad/cam, appliances, videoconfercing, digital movies



                                                         Source: The Information Explosion , 2009


4
Trend 2/3: Big Data – Driven by Real-World Benefit




5
Trend 3/3: Value from Data Exceeds Hardware Cost

 Value from the intelligence of data analytics now outstrips the cost
    of hardware
    • Hadoop enables the use of lower cost hardware
    • Hardware cost halving every 18mo
                                                      Value
                  Big Iron:
                  $40k/CPU

                                                              Commodity
                                                              Cluster:
                                                              $1k/CPU
                                        Cost




6
Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

                            Trend is ―not just hadoop‖ for big data
                            • Hadoop is often combined with other
                              technologies: Big SQL, NoSQL etc,…

SQLCluster
                            • Unify the infrastructure platform for all


                                  Big SQL        NoSQL          Hadoop
     NoSQL Cluster

                                       Unified Big Data Infrastructure

                                            Private
                                                      Public
 Hadoop Cluster
                             Common Hardware Base
                              • Eliminate the hardware/driver/testing phase
                              • Use existing team for
             DSS Cluster       ordering, diagnosis, capacity management of
 7
                               hardware farm
Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

                                 Instant Cluster Provisioning
                                  • Provision Hadoop Clusters instantly
                                  • Automatable using provisioning
                                   engines/scripts: e.g. whir




  8
Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

 Increase Utilization
    • Hadoop cluster only uses resources it needs
    • Extra resources can be used by other applications when not in use
 Eliminate single points of failure
    • Use vSphere HA for Namenode and Jobtracker
 Use VM Isolation
    • Create separate clusters with defensible security
    • Enables multiple-versions of Hadoop on the same infrastructure
    • Extends to Hadoop and Linux Environments
 Leverage Resource Management
    • Control/assign resources through resource pools
    • E.g. Use spare cycles for Hadoop Processing through priority control



9
What? Hadoop in a VM? Really?




        Actually, Hadoop performs well in a virtual machine




10
Performance Test: Cluster Configuration



                Mellanox10 GbE switch



     AMAX ClusterMax
     2X X5650, 96 GB
     12X SATA 500 GB
     Mellanox 10 GbE adapter




11
Cluster Configuration
 Hardware
 • AMAX ClusterMax, 7 nodes
 • 2X X5650 2.67 GHz hex-core, 96 GB memory
 • 12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4
 • Mellanox ConnectX VPI (MT26418), 10 GbE
 • Mellanox Vantage 6048, 10 GbE
 OS/Hypervisor
 • RHEL 6.1 x86_64 (native and guest)
 • ESX 5.0 RTM with devel Mellanox driver
 VMs (HT off/on)
 • 1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks
 • 2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks
 • 4 VMs (HT on only):
     • 2 small: 18400 MB, 5 vCPUs, 2 disks
     • 2 large: 27600 MB, 7 vCPUs, 3 disks
12
Hadoop Configuration
Distribution
  • Cloudera CDH3u0
  • Based on Apache open-source 0.20.2
Parameters
 • dfs.datanode.max.xcievers=4096
 • dfs.replication=2
 • dfs.block.size=134217728
 • io.file.buffer.size=131072
 • mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
 • mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
 Network topology
  • Hadoop uses info for reliability and performance
  • Multiple VMs/host: Each host is a “rack”


13
Benchmarks
 Derived from test apps included in distro
 Pi
 • Direct-exec Monte-Carlo estimation of pi
 • # map tasks = # logical processors
 • 1.68 T samples
 TestDFSIO
 • Streaming write and read
                                                       ~ 4*R/(R+G) = 22/7
 • 1 TB
 • More tasks than processors
 Terasort
 • 3 phases: teragen, terasort, teravalidate
 • 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
 • More tasks than processors
 • CPU, networking, and storage I/O

14
Performance of Hadoop for Several Workloads

                             Ratio of time taken – Lower is Better
                       1.2


                        1


                       0.8
     Ratio to Native




                       0.6


                                                                     1 VM
                       0.4
                                                                     2 VMs

                       0.2


                        0




15
Architecting Hadoop as a Service using Virtualization

 Goals
 • Make it fast and easy to provision new Hadoop Clusters on Demand
 • Leverage virtual machines to provide isolation (esp. for Multi-tenant)
 • Optimize Hadoop’s performance based on virtual topologies
 • Make the system reliable based on virtual topologies
 Leveraging Virtualization
 • Elastic scale in/out
 • Use high-availability to protect namenode/job tracker
 • Resource controls and sharing: re-use underutilized memory, cpu
 • Prioritize Workloads: limit or guarantee resource usage in a mixed
     environment




16
Provisioning

 Leverage the vSphere APIs to auto-deploy a cluster
 • Whirr, HOD, or custom using ruby, chef, etc,…
 Use linked-clones to rapidly fork many nodes




17
Fast Provisioning

 From a ―seed‖ node to a cluster




     Thin Provisioning              Linked Clone




        60GB => 3.5GB               ~6 second

18
SAN, NAS or Local Disk?

  Shared Storage: SAN or NAS                                                                 Hybrid Storage
         • Easy to provision                                                                  • SAN for boot images, VMs, other
         • Automated cluster rebalancing                                                            workloads
                                                                                              • Local disk for HDFS
                                                                                              • Scalable Bandwidth, Lower Cost/GB
           Other VM

                      Other VM




                                                    Other VM




                                                                                  Other VM




                                                                                                         Other VM

                                                                                                                    Other VM




                                                                                                                                                  Other VM




                                                                                                                                                                                Other VM
Hadoop




                                 Hadoop

                                           Hadoop




                                                               Hadoop

                                                                         Hadoop




                                                                                               Hadoop




                                                                                                                               Hadoop

                                                                                                                                         Hadoop




                                                                                                                                                             Hadoop

                                                                                                                                                                       Hadoop
          Host                            Host                          Host                            Host                            Host                          Host




     19
Enable Automatic Rack awareness through vSphere

 Important to robust hadoop
 cluster


 Automatic network topology
 detect — an important
 vSphere feature


 Rack script is generated
 automatically




20
Multi-tenant: share cluster or not

      Shared big cluster        VS.       Isolated small clusters




        High performance                          Secure
           Large scale                           Flexible
       Pre-job provisioning                Post-job provisioning

Combination – as   customers’ requirement are different

21
Elastic Hadoop Cluster

 Traditional hadoop cluster
     • Easy to scale out
       • Fast-provision new hadoop nodes and join into existing cluster
     • Hard to scale in
 While (ClusterIsTooLarge) {
      choose node k;
      kill (node k);
      wait (k’s data block is recovered);
      if necessary, hadoop.rebalance();
 }

 Elastic hadoop cluster
                                            …
                                                                          Normal node

      NN                                                     JT           Elastic node

                                                                          TaskTracker
                                            …
                                                                          DataNode

22
Replica Placement

 Second Replica
 • Different rack
 • Rack-awareness required


 Third Replica
 • Same rack, different physical host
 • Nodes share host (in virtualized
     environment)




23
Demo




24
Performance

 Create more smaller VMs
 • Makes Hadoop scale better
 • Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
     (including through DRS)
 Sizing/Configuration of storage is critical
 • Plan on ~50Mbytes/sec of bandwidth per core
 • SANs are typically configured by default for IOPS, not Bandwidth
 • Ensure SAN ports/switch topology allows required aggregate bandwidth
 • Performance of the backend storage should be tested/sized
 • Local disks will give ~100-140MBytes/sec per disk: pick correct controller




25
Summary

 Hadoop does work well in a virtual environment
 Plan a virtual cluster, enable other big-data solutions on the same
 infrastructure
 Leverage the recipes to automate your configuration and
 deployment




26

Contenu connexe

Tendances

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSEnterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSAmazon Web Services
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...DataWorks Summit
 
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMUse Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMAmazon Web Services
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
32984 cloud system la-bcs
32984 cloud system la-bcs32984 cloud system la-bcs
32984 cloud system la-bcsgmazuel
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
 
Use the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp StorageUse the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp StorageProact Netherlands B.V.
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certificationCloudera, Inc.
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIsCisco DevNet
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureUtkarsh Pandey
 
Best Practices for Monitoring Postgres
Best Practices for Monitoring Postgres Best Practices for Monitoring Postgres
Best Practices for Monitoring Postgres EDB
 

Tendances (20)

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSEnterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
 
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVMUse Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
Use Hybrid Cloud to Streamline SAP with NetApp, AWS and SAP LVM
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
32984 cloud system la-bcs
32984 cloud system la-bcs32984 cloud system la-bcs
32984 cloud system la-bcs
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Hybrid is the New Normal
Hybrid is the New NormalHybrid is the New Normal
Hybrid is the New Normal
 
Use the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp StorageUse the power of Microsoft Azure with NetApp Storage
Use the power of Microsoft Azure with NetApp Storage
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
SAP on AWS
SAP on AWSSAP on AWS
SAP on AWS
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azure
 
Best Practices for Monitoring Postgres
Best Practices for Monitoring Postgres Best Practices for Monitoring Postgres
Best Practices for Monitoring Postgres
 

Similaire à Hadoop World 2011: Hadoop as a Service in Cloud

Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Steve Staso
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big dataxKinAnx
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationVlad Ponomarev
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsū
 

Similaire à Hadoop World 2011: Hadoop as a Service in Cloud (20)

Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Dernier (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Hadoop World 2011: Hadoop as a Service in Cloud

  • 1. Hadoop as a Service Jun Ping Du Richard McDougall VMware, Inc. © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. Infrastructure, Apps and now Data… Build Run Private Public Manage Simplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
  • 4. Trend 1/3: New Data Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio generation… digital tv digital photos camera phones, rfid medical imaging, sensors satellite images, games, scanners, twitter cad/cam, appliances, videoconfercing, digital movies Source: The Information Explosion , 2009 4
  • 5. Trend 2/3: Big Data – Driven by Real-World Benefit 5
  • 6. Trend 3/3: Value from Data Exceeds Hardware Cost  Value from the intelligence of data analytics now outstrips the cost of hardware • Hadoop enables the use of lower cost hardware • Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 6
  • 7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware  Trend is ―not just hadoop‖ for big data • Hadoop is often combined with other technologies: Big SQL, NoSQL etc,… SQLCluster • Unify the infrastructure platform for all Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster  Common Hardware Base • Eliminate the hardware/driver/testing phase • Use existing team for DSS Cluster ordering, diagnosis, capacity management of 7 hardware farm
  • 8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning I WANT MY HADOOP CLUSTER NOW!  Instant Cluster Provisioning • Provision Hadoop Clusters instantly • Automatable using provisioning engines/scripts: e.g. whir 8
  • 9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities  Increase Utilization • Hadoop cluster only uses resources it needs • Extra resources can be used by other applications when not in use  Eliminate single points of failure • Use vSphere HA for Namenode and Jobtracker  Use VM Isolation • Create separate clusters with defensible security • Enables multiple-versions of Hadoop on the same infrastructure • Extends to Hadoop and Linux Environments  Leverage Resource Management • Control/assign resources through resource pools • E.g. Use spare cycles for Hadoop Processing through priority control 9
  • 10. What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine 10
  • 11. Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter 11
  • 12. Cluster Configuration  Hardware • AMAX ClusterMax, 7 nodes • 2X X5650 2.67 GHz hex-core, 96 GB memory • 12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 • Mellanox ConnectX VPI (MT26418), 10 GbE • Mellanox Vantage 6048, 10 GbE  OS/Hypervisor • RHEL 6.1 x86_64 (native and guest) • ESX 5.0 RTM with devel Mellanox driver  VMs (HT off/on) • 1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks • 2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks • 4 VMs (HT on only): • 2 small: 18400 MB, 5 vCPUs, 2 disks • 2 large: 27600 MB, 7 vCPUs, 3 disks 12
  • 13. Hadoop Configuration Distribution • Cloudera CDH3u0 • Based on Apache open-source 0.20.2 Parameters • dfs.datanode.max.xcievers=4096 • dfs.replication=2 • dfs.block.size=134217728 • io.file.buffer.size=131072 • mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native) • mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)  Network topology • Hadoop uses info for reliability and performance • Multiple VMs/host: Each host is a “rack” 13
  • 14. Benchmarks  Derived from test apps included in distro  Pi • Direct-exec Monte-Carlo estimation of pi • # map tasks = # logical processors • 1.68 T samples  TestDFSIO • Streaming write and read ~ 4*R/(R+G) = 22/7 • 1 TB • More tasks than processors  Terasort • 3 phases: teragen, terasort, teravalidate • 10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) • More tasks than processors • CPU, networking, and storage I/O 14
  • 15. Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 0 15
  • 16. Architecting Hadoop as a Service using Virtualization  Goals • Make it fast and easy to provision new Hadoop Clusters on Demand • Leverage virtual machines to provide isolation (esp. for Multi-tenant) • Optimize Hadoop’s performance based on virtual topologies • Make the system reliable based on virtual topologies  Leveraging Virtualization • Elastic scale in/out • Use high-availability to protect namenode/job tracker • Resource controls and sharing: re-use underutilized memory, cpu • Prioritize Workloads: limit or guarantee resource usage in a mixed environment 16
  • 17. Provisioning  Leverage the vSphere APIs to auto-deploy a cluster • Whirr, HOD, or custom using ruby, chef, etc,…  Use linked-clones to rapidly fork many nodes 17
  • 18. Fast Provisioning  From a ―seed‖ node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second 18
  • 19. SAN, NAS or Local Disk?  Shared Storage: SAN or NAS  Hybrid Storage • Easy to provision • SAN for boot images, VMs, other • Automated cluster rebalancing workloads • Local disk for HDFS • Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
  • 20. Enable Automatic Rack awareness through vSphere  Important to robust hadoop cluster  Automatic network topology detect — an important vSphere feature  Rack script is generated automatically 20
  • 21. Multi-tenant: share cluster or not  Shared big cluster VS. Isolated small clusters High performance Secure Large scale Flexible Pre-job provisioning Post-job provisioning Combination – as customers’ requirement are different 21
  • 22. Elastic Hadoop Cluster  Traditional hadoop cluster • Easy to scale out • Fast-provision new hadoop nodes and join into existing cluster • Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); }  Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode 22
  • 23. Replica Placement  Second Replica • Different rack • Rack-awareness required  Third Replica • Same rack, different physical host • Nodes share host (in virtualized environment) 23
  • 25. Performance  Create more smaller VMs • Makes Hadoop scale better • Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS)  Sizing/Configuration of storage is critical • Plan on ~50Mbytes/sec of bandwidth per core • SANs are typically configured by default for IOPS, not Bandwidth • Ensure SAN ports/switch topology allows required aggregate bandwidth • Performance of the backend storage should be tested/sized • Local disks will give ~100-140MBytes/sec per disk: pick correct controller 25
  • 26. Summary  Hadoop does work well in a virtual environment  Plan a virtual cluster, enable other big-data solutions on the same infrastructure  Leverage the recipes to automate your configuration and deployment 26