Disaster Recovery on Demand on the Cloud

Protect your app from Outages
Nati Shalom CTO GigaSpaces
@natishalom
May 2013

 AWS and outages
 Outage impact
 Disaster Recovery – it’s all about redundancy!
 Cloudify as a solution for redundancy
 Demo with Cloudify on EC2
® Copyright 2013 GigaSpaces Ltd. All Rights Reserved2
AGENDA

3
AWS USAGE
• AWS – around 0.5M servers
• Facebook – less than 0.1M servers
• Google – around 1M servers

OUTAGE – APRIL 21, 2011

OUTAGE - JUNE 29, 2012

OUTAGE - OCTOBER 22, 2012

OUTAGE - CHRISTMAS EVE 2012

NOT ONLY AMAZON
 28 December 2012 - some owners of
Microsoft's XBox 360 gaming console were
unable to access some of their cloud-based
storage files.
 26 July 2012 - Service for Microsoft’s
Windows Azure Europe region went down for
more than two hours
 29 February 2012 - The ultimate result was
service impacts of 8-10 hours for users of
Azure data centers in Dublin, Ireland, Chicago,
and San Antonio.

10
THAT’S WHAT YOU EXPECT?
99% - 3.65 days downtime
99.9% - 8.76 hours downtime
99.99% - 53 minutes downtime
99.999% - 5.26 minutes downtime

OUTAGE IMPACT – DESIGN FOR FAILURES
Outage could cost…
$89K per hour for Amadeus
$225K per hour for PayPal!

14
PREPARE FOR DISASTER RECOVERY
•Dedicated expert for DR architecture
•Define target recovery time & point
•Assume every tier can fail
•Use monitoring and alerts
•Document your operational processes

Leverage Existing Automation Frameworks
Configuration Centric APP Centric (PaaS)

CLONE YOUR ENV - HOW DOES IT WORK?

BUILT IN SUPPORT FOR MANAGING DATA IN THE CLOUD
Real Time Relational DB
Clusters
NoSQL Clusters Hadoop
Storm MySQL MongoDB Hadoop (Hive,
Pig,..)
Elastic Caching XAP Postgress Cassandra ZooKeeper
Couchbase
ElasticSearch

 Technology-based concrete
process control and information
service
 Deployments across North
America, Latin America, Asia, and
Europe for nearly a decade
 Part of W.R. Grace & Co , $6.3 B
Company.
 The problem: On-Demand HA/DR
over multiple Cloud regions.
CASE STUDY: VERIFI
24
High
Availability
Data
Replication
Disaster
Recovery

ELASTIC ON-DEMAND DISASTER RECOVERY
25
 Problem
 Can we eliminate the
RTO vs. Cost trade-off
in the cloud?
 Solution (Elastic DR)
 A hybrid between Hot
and Warm DR
 Switch to Active site
in matter of seconds
through cloud-
agnostic lifecycle
automation recipes

VERIFI (INITIAL) ARCHITECTURE
26
Availability region (US-West: Oregon)
Data Volume
Internet EC2 Instance
mod_cluster
EC2 Instance
JBoss
Data Volume
EC2 Instance
EC2 Instance
PostgresSQL
Cassandra
4 recipes

ELASTIC DR ON-DEMAND: FAILOVER SCENARIO
27
Region (US-West Oregon)
App Servers
PostgresSQL
Region (US-East Virginia)
PostgresSQL
Cloud #1 Cloud #2
Region (US-East Virginia )
PostgresSQL
Cloud #1 Cloud #2
App Servers
Region (US-West California)
PostgresSQL
Cloud #3
Region failure
occurs
* Initially, all those actions may be done manually by
Verifi’s Ops team (e.g.: via recipe commands in CLI)
Bootstrap another cloud in
a different region using the
same application recipe
used to bootstrap cloud #2
above*
Liveness poll
Liveness poll
Upon initial deployment, the primary deplyoment
of the application “verifi” will be bootstrapped
onto cloud #1, another slightly modified
application recipe “verifi_dr” will be bootstrapped
as cloud #2, polling cloud #1 for failure, and acting
as a PostgresSQL db slave.
Turn Postgres slave into
master, Start app server
instances*

FAILOVER SCENARIO
28
Region (US-West Oregon)
App Servers
PostgresSQL
Region (US-East Virginia)
PostgresSQL
Cloud #1 Cloud #2
Region (US-East Virginia )
PostgresSQL
Cloud #1 Cloud #2
App Servers
Region (US-West California)
PostgresSQL
Cloud #3
Region failure
occurs
Bootstrap another cloud in
a different region using the
same application recipe
used to bootstrap cloud #2
above*
Liveness poll
Liveness poll
Upon initial deployment, the primary deployment
of the application will be bootstrapped onto cloud
#1, another slightly modified application recipe
will be bootstrapped as cloud #2, polling cloud #1
for failure, and acting as a PostgresSQL db slave.
Turn Postgres slave into
master, Start app server
instances*

Copyright 2012 Gigaspaces. All Rights Reserved29
NEXT STEPS
Across clouds
(AWS, Rackspace, Azure…etc)
Across AWS regions
Across AWS zones
1 application
+ overrides
Several cloud
drivers
1 application
+ overrides
1 cloud driver
1 application +
overrides
1 cloud driver
Availability
Supported by
Verifi phase #1

ELASTIC ON-DEMAND DR: COSTS
Main Site (US-West) Warm DR Site (US-East) Hot DR Site
Cost $82,068 $12,625 $82,068
 Main Site
 1 Load balancer, 2 JBoss instances, 1 PostgreSQL master, 3 Cassandra
 DR Site
 1 PostgreSQL slave – All other instance start on demand upon failover

ELASTIC DR: WARM DR COST, CLOUD PORTABILITY
4 recipes
DR Site
$12k
SameRecipe
$14k
$6k
$5k
$9k

ELASTIC DR: HOT DR COST
4 recipes
DR Site
$82k
SameRecipe
$79k
$115k
$68k
$91k

 Disaster Recovery – it’s all about redundancy!
 Cloning your environment – app stack
 Cloning your Data – DB Replication
 Automation makes DR processes simple
 Use recipes to clone your app stack consistently
 Use replication to clone your data
 Leverage cloud economics to reduce the cost
 DR on Demand
 Multi Cloud
SUMMARY

Thank You!
@natishalom
QUESTIONS & ANSWERS

Disaster Recovery on Demand on the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Disaster Recovery on Demand on the Cloud

Similar to Disaster Recovery on Demand on the Cloud (20)

More from Nati Shalom

More from Nati Shalom (20)

Recently uploaded

Recently uploaded (20)

Disaster Recovery on Demand on the Cloud

Editor's Notes