4. 4#
My relationship with HA 2001
How many 9-s can
your product do?
Cloud Management #rightscale
5. 5#
So what did they mean by 5-9s?
Availability Allowed Down Time each Year
99% 3.65 days
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.26 minutes
Cloud Management #rightscale
11. 11#
Golden Age of Cloud Computing
No Up-Front Low Cost Pay Only for
Capital Expense What You Use
Self-Service Easily Scale Up Improve Agility &
Infrastructure and Down Time-to-Market
Deploy
Cloud Management #rightscale
12. 12#
Golden Age for Fault-Tolerance
No Up-Front HA Low Cost Pay for DR Only
Capital Expense Backups When You Use it
Self-Service Easily Deliver Fault- Improve Agility &
DR Infrastructure Tolerant Applications Time-to-Recovery
Deploy
Cloud Management #rightscale
13. 13#
Yeah, but …
What about my private cloud?
Applications deployed in private clouds have to worry about:
• Private Cloud Infrastructure being HA
• Application architecture HA / DR
• With Public Clouds – Well, you get what your provider gives
you
Cloud Management #rightscale
14. 14#
Private Cloud Infrastructure HA
Several single points of failure in OpenStack deployment
• OpenStack API services
• MySQL
• RabbitMQ
Solved in various ways
• Pacemaker cluster management
• Keepalived (e.g: RAX Private Cloud)
• MySQL (Galera), RabbitMQ (active-active mirrored queues)
Eliminate SPoFs as best as you can.
Cloud Management #rightscale
15. 15#
What about my app?
Design for failure:
• If your application relies on Cloud infrastructure
SLA for its HA needs, you are STUCK with that
vendor / infrastructure
• Need to balance cost and complexity against risk
tolerance
• Design application so that its:
Build for server failure
Build for zone failure
Build for cloud failure
Keep management layer separate from infrastructure
Cloud Management #rightscale
16. 16#
Build for Server Failure
• Set up auto-scaling
• Set up database mirroring,
master/slave configuration
• Use static public IPs
• Use Dynamic DNS for
private IPs
Cloud Management #rightscale
17. 17#
Build for Zone Failure
Static Public IPs
DNS
172.168.7.31 172.168.8.62
Zone 1 Zone 2
1
LOAD BALANCERS LOAD BALANCERS Where possible,
use NoSQL DB
like Cassandra
or MongoDB
APP SERVERS
AUTOSCALE
MASTER DB SLAVE DB
REPLICATE
Block
SNAPSHOTS
Object store
Snapshot data volume for backups so
Place Slave databases in one
the database can be readily recovered
or more zones for failover.
within the region.
A creative deployment model would be to make your private cloud an “AZ” by placing
it in close physical proximity to a public cloud provider
Cloud Management #rightscale
18. 18#
Build for Cloud Failure (Cold DR)
Staged Server Configuration and generally no staged data
$
• Not recommended if rapid recovery is required
• Slow to replicate data to other cloud and bring database online
DNS
172.168.7.31
Private DALLAS
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE
Block
SNAPSHOTS
CLOUD
Cloud Management FILES #rightscale
19. 19#
Build for Cloud Failure (Warm DR)
Staged Server Configuration, pre-staged data and running Slave Database Server
$$
• Generally recommended DR solution
• Minimal additional cost and allows fairly rapid recovery
DNS
172.168.7.31
Private DALLAS
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE REPLICATE
Block
SNAPSHOTS
SNAPSHOTS
CLOUD
Cloud Management FILES #rightscale
20. 20#
Build for Cloud Failure (Hot DR)
Parallel Deployment with all servers running but all traffic going to primary
$$$
• Not recommended
• Very high additional cost to allow rapid recovery
DNS
172.168.7.31
Private DALLAS
LOAD BALANCERS LOAD BALANCERS
APP SERVERS APP SERVERS
MASTER DB SLAVE DB SLAVE DB
REPLICATE REPLICATE
Block
SNAPSHOTS SNAPSHOTS
CLOUD
Cloud Management FILES #rightscale
23. 23#
Automate and test everything
• Automate backups of your data
• Setup monitoring and alerts
• Run fire-drills! Plan and Practice your recovery procedures!
Cloud Management #rightscale
24. 24#
Separate Management layer from Infrastructure
• Keep the keys to the car outside the car
Cloud Management #rightscale
25. 25#
Automating HA and DR
• Use dynamic DNS for your database servers
• Allow app servers to use a single FQDN.
• Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers
• App servers can connect to all load balancers automatically at launch
• No manual intervention
• No DNS modifications
• Automated promotion of slave to master
• Process is automated
• Decision to run process is manual
Cloud Management #rightscale
28. 28#
How RightScale makes it possible
RightScale ServerTemplates™
• Reproducible: Predictable
deployment
• Dynamic: Configuration from
scripts at boot time
• Multi-cloud: Cloud agnostic
and portable
• Modular: Role and behavior
abstracted from cloud
infrastructure
Cloud Management #rightscale
29. 29#
How RightScale makes it possible
MultiCloud Images
• MultiCloud Images can be launched across regions and clouds
without modification
ServerTemplate contains a list
1 of MultiCloud Images (MCIs)
When the Server is
2 created, a specific MCI
is chosen.
The appropriate
3 RightImage is used at
MultiCloud Images
launch.
Cloud A, B, Image 1
Cloud A C, Image 2
Cloud B, Image 1 Cloud A, B, Image 1
Cloud B
Stability across clouds
Image 1
RightImage
Cloud Management #rightscale
30. 30#
Outage-Proofing Best Practices
Place in >1 Replicate data Replicate data
zone: across zones across zones
• Load balancers Backup across Design stateless
• App servers regions & clouds apps for
• Databases Monitoring, alert, resilience to
Maintain and automate reboot / relaunch
capacity to operations to
absorb zone or speed up
region failures failover
Cloud Management #rightscale
31. 31#
Thank you!
Sign-up for a free account at: www.rightscale.com
Check out job postings are: www.rightscale.com/jobs
We are hiring!
Cloud Management #rightscale
Notes de l'éditeur
Good afternoon folks, Hope you are here for the high availability discussion.. In case of an emergency, we have specially arrange a highly available pair of exits to your left and behind ya..So, let me tell u a bit about myself and what HA means to me.. I am a product manager at RightScale..
My relationship with HA goes back all the way to my kindergarten years, growing up in India. Going to my first big kindergarten exam, I recall worrying about having more than one sharpened pencils in my pencil box ready to go. And yes, kindergarteners have exams in India, but that’s an entirely different discussion. Fast forward to my college days, taking my big 747 flight to california. Yes, you guessed it, I worried about the plane having enough engines so if one of them failed, I wouldn’t become fish food in the pacific ocean Fast forward few more years to my telecommunication days – visiting KDDI and NTT DoCoMo in Japan for discussion on our messaging product.. They pretty much immediately got to the topic of “how many 9s does your product do”? Any anything less than 5-9s would not have been an acceptable answer in the heavily regulated Japanese telecommunication market.
Fast forward to my college days, taking my first big flight on a 747 to california. Yes, you guessed it, I worried about the plane having enough engines so if one of them failed, I wouldn’t become fish food in the pacific ocean
Fast forward few more years to my telecommunication days – visiting KDDI and NTT DoCoMo in Japan for discussion on our messaging product.. They pretty much immediately got to the topic of “how many 9s does your product do”? Any anything less than 5-9s would not have been an acceptable answer in the heavily regulated Japanese telecommunication market.
Quick definition of how the “9s” availability translates to allowed downtime each year
Leap forward to 2012 – the cloud era is in full swing. Behemoth cloud providers are stamping out VMs like Oreo cookies, while preaching the mantra “everything fails all the time”.And rightfully so – In 2012, we saw 27 sizable outages in public, private, hosting and SaaS providers.Infographic -- not just restricted to cloud computing only..- 7 major cloud outages in 2012.. Average company has 1 major and 3 minor DC outages per year$5k per min of downtime (avg cost)They are starting to become more and more public as more people are getting on the cloud..-May of 2010. - first big one that happened was in- April 2011 -- lot of people that got a lot of press
Among the top-5 causes for outages were power loss, natural disasters, software bugs that cascaded and operator errors.Even though large scale outages are rare, they do happen and will continue to happen in the future.
In the aftermath of outages, you see these..Outages are expensive – there is nothing more frustrating to a modern day consumer to go to a website and see its down.. Every minute of downtime affect your revenue and your brand reputation. Computer Associates did a study last year that the cost of outages is about $26 Billion a year.Cost of
We are in the golden age of cloud computing..
At the end of the day, you are responsible for the HA of your application. Cloud infrastructure provides tools.Relying on cloud infrastructure for HA is a recipe for trouble as this locks you into that cloud infra.. You need portability, so when you move your application to another cloud, it stands on its own merit.Complexity of HA against the risk.. Auto and home insurance. The cost of HA goes up exponentially as you reduce your tolerance for downtime (Recovery time objective) as well as tolerance for data loss (Recovery Point objective).
This is what we generally recommend when someone comes to us and says I want HAThree tiered ApplicationRR DNS Load BalancersArray of Application ServersMaster – Slave DatabasesAtleast one of each component in each AZPlace slave database in different zone, so if one of the zones were to go down, you will not have an outage.. Granted there will be some performance degradation..
During emergencies, time is precious – make sure it works
If both goes down, u have no where to go..if the disaster hits management, u still have the app,if the disaster hit app u can execute on DR scenarios..
Which parts you should automate and which parts you shouldn’t..We always recommend using dynamic DNS for your DB servers.. This allows app servers to use a single FQDN that can be resolved by the dynamic DNS. So in case of a failover, Dynamic DNS gets automatically updated and the servers will discover the new DB once the TTL expires.Use low TTL(e.g: mymaster.mydomain.com)We recommend automating the process of connecting apps servers to LBs. So when a new app server fires up, it automatically registers itself to the load balancer without manual interventionThe process is automated, decision to run the process is manual.. Once u pushed that button, there is no going back, so make sure u are certain before you failover.. The promotion happened in case where the master wasn’t really down but it resulted
I AM representing RightScale today, so a little bit on how RightScale can help.Server templates allow you to pre-configure servers by starting from a base image and adding scripts that run during boot, operational and shutdown phases of a server instance.The key benefit of a server template is that they help you create a easily reproducible server setup. And this can be done across multiple clouds..Through the server configuration mechanism that is built into the server templates, they servers have the ability to automatically join load balancer pools, autoscale across zones etc.
I AM representing RightScale today, so a little bit on how RightScale can help.Server Template contains a list of multi-cloud images.. When a server is created, Quickly, efficiently and repeatably