1. Is Cloud a right companion for
Hadoop?
Saravanan Prabhagaran
& Chintan Bhatt
2. Agenda
Hadoop and cloud Primer
Hadoop challenges
Hadoop on cloud - Advantages and Challenges
Hadoop on cloud offerings
Typical use cases of Hadoop on cloud
Considerations Hadoop deployment
Conclusion
2
5. Hadoop on cloud Advantages
5
Hadoop
Cloud
Lower
operations
cost
On-Demand
Faster
Provisioning
Efficient
resource
utilization
Distributed /
Parallel
Processing
Elasticity
Agility
Pay as you go
Higher
Throughput
6. Hadoop on Cloud Challenges
Data locality vs On-demand Cloud
Import / export parallelism, Interruption
Applications sensitive to latencies experience higher
overheads
Higher overheads by VM than a bare metal
6
7. Hadoop and Cloud based offerings
7
AWS
MS Azure
Rackspace
Cloudera
Hortonworks
MapR
Joyent
EMR
HDInsight
GoGrid
Rackspace Cloud
Big Data platform
Project Sahara
Pivotal
8. Project Sahara
Openstack component
Hadoop cluster
provisioning
REST API
Integration with Hadoop
management tools
Hadoop configuration
templates
8
Use Cases
Faster cluster
provisioning for
DEV/QA
Ad-hoc or bursty
analytical workloads
Resource utilization
from Openstack IaaS
10. Typical use cases of Hadoop on Cloud
On-Demand Analytics
Dev/QA or POC environment
Cluster required for executing Nightly, Weekly, Monthly jobs
Application deployed on Cloud and Data to be used is in
cloud
10
11. Considerations - Hadoop on Bare-metal or on cloud
Capex Vs Opex
Performance Vs price-performance
Data gravity
Regulatory requirements
Agility
11
12. Hadoop on Cloud?
12
Performance
?
Bare MetalYes
Data
Gravity
Public Cloud/
Hosted Hadoop
In-Premise
Bare Metal
Private Cloud
In Cloud?
In-premise?
Control
over Data
In-Premise
Bare Metal
Private Cloud
Strict control?
13. Hadoop on Cloud?
13
POC/DEV/
QA
Public Cloud/
Hosted Hadoop
In-Premise
Private Cloud
CapEx Vs
OpEx?
In-Premise
Bare Metal
Private Cloud
CapEx?
Public Cloud/
Hosted Hadoop
OpEx?
Hi Good afternoon everyone. My name is Chintan Bhatt and I have with me Saravanan. We both work as part of Syntel Big Data practice. Today we want to talk about two of the most popular technologies. Hadoop which is a large scale distributed processing infrastructure and cloud computing which is known for its agility, elasticity and economy. In our session today we intend to discuss about need and challenges of Hadoop on cloud.
So, The agenda of the session is as follows:
First we will start with basic 101 of Hadoop and Cloud Computing in which we will highlight the important features of both the technologies.
Then we look at some of the challenges of Hadoop alone.
After that we showcase the advantages of Hadoop on Cloud.
Then we explore some of the popular offerings of Hadoop on cloud.
And then we conclude with considerations when and where hadoop should be deployed.
Now I would like to invite saravanan to start the session.
Does the feature of Hadoop or cloud gets hampered due to integration?
Parallelism is achieved through data locality, Cloud is for on-demand, between how good is that to import and export enormous amount of data just for the sake of data locality ?
Amazon quoted that by using S3 as an input to MapReduce you lose the data locality optimization, which may be significant.
32 cores running 32 VM instances may produce very large overheads than 16 core to 16VM.
Researches indicate that the applications that are more sensitive to latencies experience higher overheads under virtualized resources, and this overhead increases as more and more VMs are deployed per hardware node.
Results show that the virtualization overhead increases with the number of VMs deployed on a hardware node. These characteristics will have a larger impact on systems having more CPU cores per node.
ThanksSarvanan. As Saravanan has pointed out underlying features of Hadoop and cloud, how Hadoop can benefit from Cloud and what are the challenges. Lets try to look at some of the offerings of Hadoop on Cloud.
We have the most popular Hadoop and cloud vendors and some of the offerings with partnership between them.
One of the most famous and popular offering is from Amazon called Elastic MapReduce. It is a Hadoop based web-service to process data stored on Amazon cloud. It allows various applications like MR, Pig, Hive, Cascading etc. for processing the data stored in Amazon S3. Amazon also offers the service which uses MapR or Cloudera as an underlying Hadoop distribution.
The other such similar offering from Microsoft is HDInsight which is Hadoop as a service on Microsoft Azure based on HDP. It uses its BLOB storage similar to the S3,wchich is just a storage service. The advantage is that it easily integrates with Microsoft based infrastructure and DOT NET applications.
Rackspace offers Hadoop based offering in different flavors of hadoop through partnership with Hortonworks. They have managed hosting of optimized HDP clusters, Public cloud based Cloud Big Data platform, which allows user to deploy, test and query Hadoop without acquiring any data or signing any contract, Private cloud and Hybrid offering with Rackconnect.
There are similar offerings by other cloud providers as well.
The other project which is in open-source community on openstack ecosystem is Project Sahara. Which we will look at in some details
Project Sahara which was previously known as Project Savanna is a joint effort by rackspace, mirantis and Hortonworks to provide users to quickly and easily provision and manage Hadoop cluster on Openstack platform
It is a native openstack component which provides REST based API to provision Hadoop cluster by providing details of number of nodes, hadoop version etc. It also provides integration with various hadoop Management tools like apache ambari to manage the cluster. It also tries to make it easy to configure the cluster by creating cluster wide configuration templates for a particular hadoop distribution, which can be created once and deployed multiple times.
Data operations allow users to be insulated from cluster creation tasks and work directly with data. I.e. users specify a job to run and data source, and Savanna takes care of the entire job and cluster lifecycle: bringing up the cluster, configuring the job and data source, running the job and shutting down the cluster. Savanna supports different types of jobs: MapReduce job, Hive query, Pig job, Oozie workflow. The data could be taken from various sources: Swift, remote HDFS, NoSQL and SQL databases.
Some of the use-cases which were aimed by Sahara are Faster clutster provisioning for Dev/ QA or POC requirements. Ad-hoc or busrty work loads similar to EMR, which will insulate the user form cluster life-cycle. And ability to utilize the resources from Openstack and provide users of Openstack a large case data computation capabilities.
This is a high level architecture of Sahara in Openstack eco-system.
Here the user can either use Horizon user interface to create, configure and manage Hadoop Cluster, or user can directly use REST API interface to perform similar operation.
The keystone component is used for authentication of user to perform operations on Openstack. Nova which is openstack’s computing fabric is used for creation of Hadoop virtual machines. The pre-configured Image with OS and Hadoop software is stored in Glance, which makes the node start-up quickly. And finally openstack storage service Swift can be used as a persisted storage for Hadoop Cluster, which can store the input/output of the Hadoop processing job. In order for hadoop to access Swift as a file system, jira Hadoop-8545 was created.
After looking at the Cloud based offerings and Hadoop on Cloud advantages, lets look at some of the use cases which makes more sense in Hadoop on cloud.
The first use case is of on-demand analytics, in which the Analytics is provided as a service, with the implementation of a particular use case, e.g. Customer Churn, recommendations, clusters etc. User is required to upload the data in cloud: Public Or Private. This is an ideal use case for Hadoop on cloud where the cluster can be provisioned based on the size of the data and computation intensity, executed and then once the result is returned cluster can be teared down.
The other obvious use case is faster provisioning of the cluster for POC, Dev and QA environements which can be quickly released.
Now most of the Hadoop jobs run NOT in near real time, but run at different frequency like nightly, weekly or monthly jobs. In such scenarios it does NOT makes sense to keep the cluster occupied for the time there is NO job running. In such case its fine to run the job on the data set, get the result and again tear down the cluster.
Now because of the popularity of the cloud there are many applications are deployed on cloud which generates a lot of data. As these data is generated on the cloud brining processing closer to the data is more ideal as it will avoid data transfer for in-house deployment. For example some data like click stream data is an ideal example which are on cloud.
To summarize the presentation, while deciding hadoop deployment mode, these are the following criterias which needs to be considered.
Capex Vs Opex: This is also a generic cloud computing deployment criteria for any application, but is also equally applicable in Hadoop as well as Hadoop requires upfront investment of its infrastructure.
The other rather important point is Performance Vs Price to performance. There are some mission critical applications which has to strictly follow SLA, will have performance as its highest priority while for some applications/organizations price to performance has more priority which will try to look at the cost in achieving the performance.
Data gravity is a consideration on where data is generated and having the data processing eco-system close to where the data is generated. Which means considering if the data is generated in-premise from internal applications or generated in the cloud.
In some of the organizations there are some regulatory requirements which may NOT allow sensitive data generated to leave the organizations boundary.
And final one is the agility, for some application like Proof of Concept, whats required to quickly setup the environment and develop it and validate the concept.
Based on the criteria, we can see based on our experience what approach can be taken for the deployment.
As mentioned before if the performance is a higher priority then having hadoop exclusive access of the physical hardware and configured and tuned for a specific application, bare-metal deployment is preferable. And that’s what we have been helping our clients in production deployment.
Considering the data where its generated if the data is generated in cloud, its best to have your data processing in cloud, it can be a Public cloud or a hosted Hadoop services. For example we did a solution for one of our retail customer where we needed to scrape the prices of the competitors and from their own websites and give them sense of the optimized price for their product and other analytics on that data. Here it made more sense to deploy that solution on cloud rather than in-house which will involve data transfer from internet to in-premise.
For data generated in-premise and also where the data cannot leave the premise, solution of either on bare-metal or private cloud within organization can be a suitable solution.
Application development with shorter life cycle like POC Or quickly provision and release resources for Dev and QA are also ideal candidate for cloud which gives agility.
And the primary factor like Capex and Opex can also rightly influence the choice of up-front investment in Infrastructure and leveraging it or eliminating the capex in cloud and paying as per use.