2. Cloud Computing
Allow use of computing resources
Services are delivered over the network
Services are divided into:
Infrastructure-as-a-Service (IaaS)
Platform-as-a-Service (PaaS)
Software-as-a-Service (SaaS)
3. Defining a Cloud
no physical object
an electronic structure
server behave as
one large storage space and processor
server clusters can provide a cloud setup
5. Popular Providers
Amazon Web Services
Salesforce.com
Microsoft Azure
Google App Engine
Hadoop
Manjrasoft Aneka
6. Amazon Web Services(AWS)
AWS is collection of Web services providing
Compute power
Storage
Content delivery
Services available in AWS ecosystem are:
Compute service
Storage service
Communication service
Additional services
7. Amazon EC2
Offers compute service and delivers IaaS
EC2 deploy servers as virtual machines
Signature features:
Amazon Machine Image (AMI)
EC2 instance and environment
AWS CloudFormation
AWS Elastic Beanstalk
8. Amazon Machine Image (AMI)
Templates to create virtual machine
Contains
Physical file system layout : Amazon Ramdisk Image
Predefined OS installed : Amazon Kernel Image
AMI created stored in S3 bucket
Product code can be associated for revenue
9. EC2 Instance
Represent virtual machines
Created by selecting
No. of cores
Computing power
Installed memory
Currently available configurations
Standard instances
Micro instances
Cluster GPU instances
EC2 instances can run by using
Command line tools
AWS console
10. EC2 Environment
EC2 instances are executed in virtual environment
EC2 environment is in charge of
Allocating addresses
Attaching storage volumes
Configuring security
11. Amazon S3
Amazon Simple Storage Service(S3) is distributed object
store.
S3 provides services for data storage and information
management
The components are:
Buckets
Objects
12. Amazon S3 Vs Distributed File System
Storage is hierarchical
Objects cannot be manipulated like standard files
Content is not immediately available to users
Request will occasionally fail
13. Features of S3
Resource Naming
Buckets
Objects and meta data
Access control and Security
Advanced features
14. Google App Engine
Paas implementation
Distributed and scalable runtime environment
Usage can be metered
17. Distributed Meme: Divide & Conquer
Specialized services
Memcache
URL Fetch
Mail
XMPP
Task Queue
Images
1
7
Datastore
Cron jobs
User Service
18. Hadoop
Software platform that lets one easily write and run
applications that process vast amounts of data.
It includes:
– MapReduce – offline computing engine
– HDFS – Hadoop distributed file system
19. What does it do?
Implements Google’s MapReduce, using HDFS
MapReduce divides applications into many small blocks
of work
HDFS creates multiple replicas of data blocks for reliability,
placing them on compute nodes around the cluster
MapReduce can then process the data where it is
located
Hadoop ‘s target is to run on clusters of the order of
10,000-nodes
20. What Hadoop provides:
Ability to read and write data in parallel to or from multiple
disks
Enables applications to work with thousands of nodes and
petabytes of data
A reliable shared storage and analysis system (HDFS and
MapReduce)
Advantages :
Scalable
Economical
Efficient
Reliable
Nutch -- Architecture wouldn’t scale to index billions of pagesPaper about GFS provided the info needed to solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. NDFS was an open source implementation of the GFSGoogle introduced MapReduce to the world, by mid 2005 the Nutch project developed an open source implementationDoug Cutting joined Yahoo!, which proviede a dedicated team and the resources to turn Hadoop into a system that ran at the web scale. This was demonstrated in February 2008 when yahoo! announced that it’s production search index was being generated by a 10,000 core Hadoop ClusterThe NY Times used Amazon’s EC2 compute cloud to crunch through 4 terabytes of scanned arhives from the paper converting them to PDFs fro the Web. The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked on without the combination of Amazon’s pay by the hour model and hadoops easy to use parallel programming model. Broke a world record to become the fastest system to sort a terabyte of data. Running on a 910 node cluster, Hadoop sorted one terabyte in 209 seconds. In November of the same year, Google announced its MapReduce implementation sorted one terabyte in 68 secods. By 2009, Yahoo! used Hadoop to sort one terabyte in 62 seconds.
Yahoo – 10,000 core Linux clusterFacebook – claims to have the largest Hadoop cluster in the world at 30 PB