VMWare Forum Winnipeg - 2012

Disaster Recovery
Russ Pedneault Anil C. Sedha Kevin Seniuk and Failover using
Technology
Services Manager
Midrange Services
Supervisor
Senior Technical
Specialist
SRM

VMWare Forum Winnipeg
May 15, 2012

Company Overview
 Largest publisher by circulation of paid English-language daily newspapers in
Canada, representing some of the country’s oldest and best known media brands.

 Reaching millions of Canadians every week

 Engage readers and offer advertisers and marketers integrated solutions to
effectively reach target audiences through a variety of print, online, digital, and
mobile platforms.

 Postmedia Network is a Mobile Web Leader – 120 Daily News media mobile sites,
80+ vertical mobile web sites, 1M monthly visitors, 9M monthly page views.

2

IT Overview
 Virtualization Platform: VMWare vSphere 4.1 and 5.0, SRM v4.1

 500+ Virtual Servers, 250 Physical servers, 3 Virtual Center servers, 4000+
desktops, 3 datacenters and 13 smaller sites

 Server Hardware: HP, Cisco, SUN, and Apha/VMS servers

 EMC Clariion and VNX arrays, HP EVA arrays, Sun Storage, Data domain VTL

 Operating System: VMWare ESXi, Windows 2003/2008, HP-UX, VMS, Red Hat
Enterprise Linux, Solaris, Suse Linux, Apple

 Messaging: Exchange 2007, MS Office Communicator, Cisco Unified Messaging

 Database: Oracle, MS SQL, Sybase, MySQL

3

Virtualization/SRM Story
Background
IT could not recover data quickly enough so Postmedia recovery plans were time consuming and
involved special recovery procedures requiring expert knowledge.

Challenges
IT environment was running mostly on old physical servers and had clustering/mirroring in place

The Inevitable Happens
- An entire datacenter goes down due to a power outage despite power protection.
- After power was restored another outage had to be taken to perform repairs.
- Enhanced recovery procedures were not in place at that time

Resolution
- Deploy virtualization first strategy
- Implement SRM with existing Storage Replication Technology
- Upgrade SRM to run with newer Storage Replication Technology

Turnaround
SRM failover brings relief and a new self confidence in the organization that data can be recovered
in a very short duration with roll back capabilities.

4

Background
Key Issues –

- Recovery timeline was unacceptable for some revenue generating applications

- Multiple resources from Application and Infrastructure teams had to be involved

- Operational sequence for recovery was manual so mistakes could easily happen

- Changes in application environments meant keeping up with those changes manually

- Managing failover/recovery of remote sites

5

Challenges

 Physical server infrastructure does not offer the flexibility for easy failover to
secondary site.

 Reliance on aging hardware – unsure if server would come up after restart

 Many manual steps needed to make remote site operational

 Required specialists to bring up Storage environment at remote site before Server
environment could be brought up.

 Clustered Environments presented additional challenges – Microsoft Cluster, HP-
UX Cluster, Sun Cluster.

 Push back from Application teams – don’t touch the server running our applications

6

Challenges
 A large number of application servers were running on physical
hardware.

 A great deal of effort was needed by both Application and
Infrastructure teams.

 Outages to critical applications for longer than expected
timeframe would mean revenue loss.

 IT had never done a datacenter recovery or failover in the past.

7

Reality Bites (Power Outage)

There was an unexpected Power Outage at one of our Datacenters and all
servers went offline for approximately an hour.

Server Recovery after power outage took further effort and quite a few
hours.

The initial event left Postmedia IT wondering what to do since a recovery
would have taken many hours.

Once power was restored, a planned failover was needed by Service
Provider to perform power infrastructure repair for around 8 hours.

Postmedia was given 5 days after negotiation (scheduled to next day
earlier) to perform the planned failover before outage.

8

What SRM did for us
 Created a complete recovery process in a simple, centralized recovery plan,
and automated recovery steps.

SRM allowed failover of the Exchange 2007 environment in minutes.

 Other application servers failed over in minutes as well.

 Half of the datacenter move was accomplished quickly and within the
expected timeframe.

 The success of SRM and Virtualization gave the impetus to create further cost
savings by virtualizing and retiring older servers.

9

What SRM did for us (Contd)
 Postmedia IT chose the approach of showcasing the benefits of
virtualization instead of forcing virtualization on the business.

 Highlighted the capabilities of SRM failover of the Exchange 2007
environment in minutes.

Recovery is very simplified and even a non-IT individual within the
organization with the authorization and awareness of documented
login procedures can press the recovery button in case of a disaster.

10

Lessons Learned

SRM recovery plans should be created based on which
application consistency groups need to be failed over together.

 Review your common outage windows based on applications

Ensure you have efficient storage replication mechanisms in place
that integrate with SRM.

Verify your Recovery Plans in advance by running a test (this does
not perform an actual failover)

11

Planned Failover - Now
 With newer replication mechanisms available in the industry it is more
easier and quicker to perform failover using SRM.

 Postmedia moved away from traditional software based replication to
hardware appliance based replication.

 We now have PVR like capabilities to rollback data to any point in time –
right down to the seconds

 Our recent array upgrade required planned failovers and we were able to
failover Exchange and other critical applications in 7-13 minutes per
recovery group.

 Tested before we failed over to ensure success

 Ran 3 recovery plans simultaneously for faster failover

12

Where we are today
450+ virtual servers, 50+ ESXi hosts

SRM 4.1 fully implemented for all virtualized production servers

Replication mechanism fully integrated and automated with SRM – wide variety of
storage related replication products

Recovery of critical applications like Exchange, Citrix, CMS, takes 7-13 minutes to
bring servers up at secondary site

Settled down on RecoverPoint appliances to perform Replication since it offers PVR
like data rollback capabilities.

The organization has adopted a “Virtualize First” strategy.

Significant ability to meet business timelines for application recovery.

Can recover an entire datacenter quickly and successfully.

13

VMWare Forum Winnipeg - 2012

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à VMWare Forum Winnipeg - 2012

Similaire à VMWare Forum Winnipeg - 2012 (20)

VMWare Forum Winnipeg - 2012

Notes de l'éditeur