The document provides an overview of disaster recovery concepts and how AWS features can be used for disaster recovery. It discusses various architectural patterns for disaster recovery on AWS ranging from simple backup and restore to fully redundant multi-site configurations. Example patterns include using S3 for backups, running a "pilot light" of reduced infrastructure for quick recovery, and running a fully scaled low-capacity standby environment. The presentation concludes by discussing solutions providers and the advantages of using AWS for disaster recovery planning.
2. Introduction Attila Narin, Sr. Manager, Solutions Architecture, EMEA Based in Luxembourg At Amazon for almost 7 years About 4.5 years at AWS Was member of EC2 Team before moving to Solutions Architecture
3. Office Hours IS Simply, Office Hours is a program the enables a technical audience the ability to interact with AWS technical experts. We look to improve this program by soliciting feedback from Office Hours attendees. Please let us know what you would like to see.
4. Office Hours is NOT Support If you have a problem with your current deployment please visit our forums or our support website http://aws.amazon.com/premiumsupport/ A place to find out about upcoming services We do not typically disclose information about our services until they are available.
5. Agenda Disaster Recovery (DR) Concepts AWS Features for DR Example Architectural Patterns for DR Solutions Providers for Backup and DR Question and Answer Please begin submitting questions now
7. Disaster Recovery Overview What is Disaster Recovery (DR)? Ability to recover from a disaster like fire, theft, physical destruction, large-scale events, etc. The process of planning, preparing, rehearsing, testing, documenting, training, and updating the process itself Goal: minimize business impact after disaster Part of Business Continuity Planning (BCP)
8. DR Objectives – Common Terms RTO: Recovery Time Objective Duration of time and service level within which a business process needs to be restored after a disaster in order to avoid unacceptable consequences Example: 4 hours RPO: Recovery Point Objective Acceptable amount of data loss measured in time Example: 2 minutes
9. DR Planning Business guides RTO/RPO Based on financial impact Based on continuity impact etc. IT seeks cost effective solutions to RTO and RPO Tradeoff: Cost vs. RTO/RPO
10. DR with AWS: Advantages Infrastructure available when you need it Multiple locations world wide Various building blocks and services available Fine control over cost vs. RTO/RPO Ability to scale up when needed; automatable No headache of provisioning physical infrastructure Ability to effectively exercise your DR plan Pay only for what you use Several options available that don’t require provisioning of duplicate infrastructure
12. AWS Features for DR Amazon Simple Storage Service (S3) Amazon Import/Export Amazon Elastic Compute Cloud (EC2) Amazon Machine Images (AMI) Reserved Instances Elastic IP Addresses VM Import Amazon Elastic Block Store (EBS) and Snapshots Amazon CloudWatch
13. AWS Features for DR Multiple Regions and Availability Zones Amazon Route 53 Amazon Virtual Private Cloud (VPC) AWS CloudFormation Amazon CloudWatch APIs and various SDKs for automation
15. Architectural Patterns Overview Variety of approaches exist Tradeoff between RTO/RPO vs. cost and complexity Example Architectural Patterns (sorted by increasingly optimal RTO/RPO) Backup and Restore “Pilot Light” for Quick Recovery Fully Working Low Capacity Standby Multi-Site Hot Standby Virtual Workstations Best Practices for Being Prepared
16. Backup and Restore Advantages Simple to get started Extremely cost effective (mostly backup storage) Preparation Phase Take backups of current systems Store backups in S3 Describe procedure to restore from backup on AWS Know which AMI to use, build your own as needed Know how to restore system from backups Know how to switch to new system Know how to configure the deployment FREE Inbound Data Transfer starting July 1st, 2011
17. Backup to S3 www.example.com Amazon Route 53 Customer Infrastructure Data copied to S3 Traditional server Bucket with Objects AWS Import/Export
18. Backup and Restore In Case of Disaster Retrieve backups from S3 Bring up required infrastructure EC2 instances with prepared AMIs, Load Balancing, etc. Restore system from backup Switch over to the new system Adjust DNS records to point to AWS Objectives RTO: as long as it takes to bring up infrastructure and restore system from backups RPO: time since last backup
19. Restore from S3 into AWS www.example.com Amazon Route 53 Data copied from objects in S3 Availability Zone Amazon Elastic Compute Cloud (EC2) EC2 quickly provisioned from AMI Pre-bundled with OS and applications Bucket with Objects AMI
20. “Pilot Light” for Quick Recovery Advantages Reduced RTO and RPO Very cost effective (very few 24/7 resources) Preparation Phase Enable replication of all critical data to AWS Standby DB, replica, mirror, etc. Reduced infrastructure that runs 24/7 in AWS Prepare all required resources for automatic start AMIs, Network Settings, Load Balancing, etc. Only runs when used for DR Reserved Instances
21. “Pilot Light” in Non-DR Phase Reverse Proxy / Caching Server www.example.com Application Server Reverse Proxy / Caching Server Not Running Application Server Database Server Database Server Smaller Instance DataVolume Data Mirroring / Replication DataVolume
22. “Pilot Light” for Quick Recovery In Case of Disaster Automatically bring up resources around the replicated core data set Scale the system as needed to handle current production traffic Switch over to the new system Adjust DNS records to point to AWS Objectives RTO: as long as it takes to detect need for DR and automatically scale up replacement system RPO: depends on replication type
23. “Pilot Light” in Disaster Phase Reverse Proxy / Caching Server Reverse Proxy / Caching Server www.example.com Application Server Application Server Not Running Database Server Database Server Smaller Instance DataVolume DataVolume
24. “Pilot Light” in Recovered Phase Reverse Proxy / Caching Server www.example.com Application Server Reverse Proxy / Caching Server Start in Minutes Application Server Database Server Database Server Resize Instance to Prod Capacity DataVolume DataVolume
25. Fully Working Low Capacity Standby Advantages Can take some production traffic at any time Cost savings (IT footprint smaller than full DR) Preparation Similar to “Pilot Light” All necessary components running 24/7, but not scaled for production traffic
26. Low Capacity Standby in Non-DR Phase Reverse Proxy / Caching Server www.example.com Amazon Route 53 Not Active for Production Traffic Active Elastic Load Balancer Application Server On site Reverse Proxy / Caching Server Scaled down Standby Master Database Server Application Server Application Data Source Cut Over Slave Database Server DataVolume Mirroring / Replication DataVolume
27. Fully Working Low Capacity Standby In Case of Disaster Immediately fail over most critical production load Adjust DNS records to point to AWS Scale the system further to handle all production load Objectives RTO: for critical load: as long as it takes to fail over; for all other load, as long as it takes to scale further RPO: depends on replication type
28. Standby Scaled Up in DR Phase Reverse Proxy / Caching Server www.example.com Amazon Route 53 Active Active Application Server Elastic Load Balancer Reverse Proxy / Caching Server Scaled up for Production Load Database Server Application Server DataVolume Master Database Server DataVolume
29. Multi-Site Hot Standby Advantages At any moment can take all production load Preparation Similar to Low Capacity Standby Fully scaling in/out with production load In Case of Disaster Immediately fail over all production load Adjust DNS records to point to AWS Objectives RTO: as long as it takes fail over RPO: depends on replication type
30. Multi-Site Hot Standby in Non-DR Phase Reverse Proxy / Caching Server www.example.com Amazon Route 53 Active Active Elastic Load Balancer On site Application Server Reverse Proxy / Caching Server Master Database Server Application Server Application Data Source Cut Over Slave Database Server DataVolume Mirroring / Replication DataVolume
31. Multi-Site Hot Standby in DR Phase Reverse Proxy / Caching Server www.example.com Amazon Route 53 Active Active Elastic Load Balancer Application Server Reverse Proxy / Caching Server Database Server Application Server Master Database Server DataVolume DataVolume
32. Multi AZ HA Deployment Reverse Proxy / Caching Server Reverse Proxy / Caching Server www.example.com Amazon Route 53 Application Server Application Server Health Check Keeps working systems in service Availability Zone A Availability Zone B Slave Database Server Master Database Server Application Data Source Cut Over DataVolume DataVolume Mirroring / Replication
33. Hosted Workstations Advantages Replacement of workstations in case of disaster Pay only when used for DR Preparation Set up AMIs with appropriate working environment In Case of Disaster Launch desktop AMI and resume work Objectives RTO: as long as it takes to launch AMI and restore work environment on virtual desktop RPO: depends on state of AMI
34. Best Practices for Being Prepared Start simple and work your way up Backups in AWS as a first step Improve RTO/RPO as a continuous effort Exercise your DR Solution Game Day Ensure backups, snapshots, AMIs, etc. are working Monitor your monitoring system Check into Licensing
38. Conclusion – Advantages of DR with AWS Various building blocks available Fine control over cost vs. RTO/RPO Ability to scale up when needed Pay only for what you use and/or in case of DR Ability to effectively exercise DR plan Availability of multiple locations world wide Hosted workstations possible Variety of Solutions Providers
39. Thank You!…and special thanks to Ianni Vamvadelis and Glen Robinson for their help preparing this presentation!
40. Question & Answer Visit http://aws.amazon.com/officehours to watch recorded sessions and to sign up for upcoming sessions.