EMC VPLEX Continuous availability and non disruptive
VMWare Forum Winnipeg - 2012
1. Disaster Recovery
Russ Pedneault Anil C. Sedha Kevin Seniuk and Failover using
Technology
Services Manager
Midrange Services
Supervisor
Senior Technical
Specialist
SRM
VMWare Forum Winnipeg
May 15, 2012
2. Company Overview
Largest publisher by circulation of paid English-language daily newspapers in
Canada, representing some of the country’s oldest and best known media brands.
Reaching millions of Canadians every week
Engage readers and offer advertisers and marketers integrated solutions to
effectively reach target audiences through a variety of print, online, digital, and
mobile platforms.
Postmedia Network is a Mobile Web Leader – 120 Daily News media mobile sites,
80+ vertical mobile web sites, 1M monthly visitors, 9M monthly page views.
2
3. IT Overview
Virtualization Platform: VMWare vSphere 4.1 and 5.0, SRM v4.1
500+ Virtual Servers, 250 Physical servers, 3 Virtual Center servers, 4000+
desktops, 3 datacenters and 13 smaller sites
Server Hardware: HP, Cisco, SUN, and Apha/VMS servers
EMC Clariion and VNX arrays, HP EVA arrays, Sun Storage, Data domain VTL
Operating System: VMWare ESXi, Windows 2003/2008, HP-UX, VMS, Red Hat
Enterprise Linux, Solaris, Suse Linux, Apple
Messaging: Exchange 2007, MS Office Communicator, Cisco Unified Messaging
Database: Oracle, MS SQL, Sybase, MySQL
3
4. Virtualization/SRM Story
Background
IT could not recover data quickly enough so Postmedia recovery plans were time consuming and
involved special recovery procedures requiring expert knowledge.
Challenges
IT environment was running mostly on old physical servers and had clustering/mirroring in place
The Inevitable Happens
- An entire datacenter goes down due to a power outage despite power protection.
- After power was restored another outage had to be taken to perform repairs.
- Enhanced recovery procedures were not in place at that time
Resolution
- Deploy virtualization first strategy
- Implement SRM with existing Storage Replication Technology
- Upgrade SRM to run with newer Storage Replication Technology
Turnaround
SRM failover brings relief and a new self confidence in the organization that data can be recovered
in a very short duration with roll back capabilities.
4
5. Background
Key Issues –
- Recovery timeline was unacceptable for some revenue generating applications
- Multiple resources from Application and Infrastructure teams had to be involved
- Operational sequence for recovery was manual so mistakes could easily happen
- Changes in application environments meant keeping up with those changes manually
- Managing failover/recovery of remote sites
5
6. Challenges
Physical server infrastructure does not offer the flexibility for easy failover to
secondary site.
Reliance on aging hardware – unsure if server would come up after restart
Many manual steps needed to make remote site operational
Required specialists to bring up Storage environment at remote site before Server
environment could be brought up.
Clustered Environments presented additional challenges – Microsoft Cluster, HP-
UX Cluster, Sun Cluster.
Push back from Application teams – don’t touch the server running our applications
6
7. Challenges
A large number of application servers were running on physical
hardware.
A great deal of effort was needed by both Application and
Infrastructure teams.
Outages to critical applications for longer than expected
timeframe would mean revenue loss.
IT had never done a datacenter recovery or failover in the past.
7
8. Reality Bites (Power Outage)
There was an unexpected Power Outage at one of our Datacenters and all
servers went offline for approximately an hour.
Server Recovery after power outage took further effort and quite a few
hours.
The initial event left Postmedia IT wondering what to do since a recovery
would have taken many hours.
Once power was restored, a planned failover was needed by Service
Provider to perform power infrastructure repair for around 8 hours.
Postmedia was given 5 days after negotiation (scheduled to next day
earlier) to perform the planned failover before outage.
8
9. What SRM did for us
Created a complete recovery process in a simple, centralized recovery plan,
and automated recovery steps.
SRM allowed failover of the Exchange 2007 environment in minutes.
Other application servers failed over in minutes as well.
Half of the datacenter move was accomplished quickly and within the
expected timeframe.
The success of SRM and Virtualization gave the impetus to create further cost
savings by virtualizing and retiring older servers.
9
10. What SRM did for us (Contd)
Postmedia IT chose the approach of showcasing the benefits of
virtualization instead of forcing virtualization on the business.
Highlighted the capabilities of SRM failover of the Exchange 2007
environment in minutes.
Recovery is very simplified and even a non-IT individual within the
organization with the authorization and awareness of documented
login procedures can press the recovery button in case of a disaster.
10
11. Lessons Learned
SRM recovery plans should be created based on which
application consistency groups need to be failed over together.
Review your common outage windows based on applications
Ensure you have efficient storage replication mechanisms in place
that integrate with SRM.
Verify your Recovery Plans in advance by running a test (this does
not perform an actual failover)
11
12. Planned Failover - Now
With newer replication mechanisms available in the industry it is more
easier and quicker to perform failover using SRM.
Postmedia moved away from traditional software based replication to
hardware appliance based replication.
We now have PVR like capabilities to rollback data to any point in time –
right down to the seconds
Our recent array upgrade required planned failovers and we were able to
failover Exchange and other critical applications in 7-13 minutes per
recovery group.
Tested before we failed over to ensure success
Ran 3 recovery plans simultaneously for faster failover
12
13. Where we are today
450+ virtual servers, 50+ ESXi hosts
SRM 4.1 fully implemented for all virtualized production servers
Replication mechanism fully integrated and automated with SRM – wide variety of
storage related replication products
Recovery of critical applications like Exchange, Citrix, CMS, takes 7-13 minutes to
bring servers up at secondary site
Settled down on RecoverPoint appliances to perform Replication since it offers PVR
like data rollback capabilities.
The organization has adopted a “Virtualize First” strategy.
Significant ability to meet business timelines for application recovery.
Can recover an entire datacenter quickly and successfully.
13
Recovery was difficult since disparate systems were in use and each one required their own recovery procedure
Recovery Timeline – Some critical applications like Canada.com have multiple consistency groups and tens of servers so it was difficult to let them stay down for a longer duration. This environment powers all of our major websites. Planned failover timeline – When we would perform failover the server recovery was essentially based on the ability of the SAN team as to how quickly they could failover the mirror volumes Multiple resources involved – Since only infrastructure was under our control the choreography of which servers come up first and then which interface servers had to be brought up next required a lot of intervention. Operational sequence – Even if servers came up in a specific sequence and we planned it there was always a chance that mistakes could happen.
We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.