16. Sample #5: Local Replication with BCVs Replication Age Inconsistency Result: Data corruption
17. Hardware 8 x CPU 2.2Ghz 32 GB RAM 2 x HBA 2 x NIC Software OS: HP-UX 11.31 WebSphere Java 1.5 EMC PowerPath 4.4 Kernel Parameters Max up processes: 8192 Max # of semaphores: 600 Sample #6: Configuration drifts between Production and DR Result: Increased time to recover Hardware 2 x CPU 2.2Ghz 8 GB RAM 1 x HBA 1 x NIC Software OS: HP-UX 11.23 NO WebSphere Java 1.4.2 EMC PowerPath 3.0.5 Kernel Parameters Max up processes: 1024 Max # of semaphores: 128 More differences in the areas of DNS, NTP, Page files, Internet services, patches, etc
18. Hardware 2 x HBA Software Microsoft .NET 2.0 SP 2 Windows x64 SP 1 Oracle MTS Recovery Service DNS Configurartion 192.168.68.50 192.168.68.51 192.168.2.50 Page Files 1 x 1 GB (c: 1 x 4 GB (d: Kernel Parameters Number of open files: 32767 Sample #7: Configuration drifts between Production and HA Result: Downtime, manual intervention needed to recover
19. Result: Reduced MTBF, Downtime, Sub-optimal performance Sample #8 - SAN I/O path - single point of failure
20. Result: File System not usable at the DR site Sample #9: Replica create time inconsistency
24. How it works Windows 2003 Server Oracle 10g schema SYMCLI/NaviCLI “ proxy ” for EMC Symmetrix / CLARiiON StorageScope API for EMC ECC SSH / WMI using valid user credentials JDBC using valid user credentials IE6+ web client Java 1.5+ HiCommand API for HDS HiCommand SSH / Telnet for NetApp filers Storage arrays Hosts DB2 Databases
Continuity got the product that can find all those problems as they happen – instead of waiting a full year (which means no protection during the year)
The RecoverGuard dashboard provides concise and valuable information regarding your DR coverage and status – in a glance. The top-left pane provides information regarding the last scan’s coverage, identifies hosts, databases, storage arrays and business services (or processes) scanned. It also points out which areas could not be reached to let the user decide on the appropriate action. Clicking the pane will reveal a detailed scan report, including scan history and statistics. The middle left pane provides a “snapshot” of the current business service protection state – identifying risks to data and system availability, as well as optimization opportunities. Clicking on any business service will reveal a more detailed information view and will allow for easy navigation into specific gap tickets The bottom tabbed view displays the top 5 currently-open tickets, as well as the top 5 recently detected ones. Clicking each ticket will open the appropriate ticket details view (see examples in next slides). Notice that each ticket is ranked by its threat level. The threat level computation process weighs, among other considerations, the following: The importance of the involved business service The role of the resource identified by the ticket (for example, is it production data? Is it a replica used for DR? is it a replica used for QA? Etc. Obviously the risk is different for each case) The technical severity of the identified gap (for example: is it a data incompleteness or inconsistency issue? Is it just a minor improvement opportunity?) As a result, the user can easily focus on the most important issues, in an educated fashion The two charts on the right provide statistics and trend information regarding identified risks.
The signature Replication Tree Structure Inconsistency The impact In case of disaster, data will be lost. A production database, volume group, disk drive or file system is partially replicated. Data is not recoverable. Technical details In this example, the production database is across three storage volumes. The intent is to replicate these production storage volumes to the disaster recovery site, however, one production storage volume is missing an assigned replication device. Can it happen to me? This is a very common gap found in the environments we have scanned. There are many reasons it could happen, only to be revealed during actual disaster. The most common reason is the production storage volume is not added to the device group to be replicated Relevant Storage Vendors : All Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Replication Inconsistency – Different RDF groups The impact In the event that one RDF group becomes out of sync with the other RDG group the database at the disaster recovery site would be corrupt and will not be recoverable from the replication technology. Data will need to be restored from a recent backup increasing the time to recovery. Technical details The storage volumes that are used for the database are in two different RDF groups. This is stated by EMC not to be a good practice if the RDF groups are not in a consistency group. Each RDF group is associated with different replication adapters and potentially different network infrastructures which can have failures independent of the other RDF group, which would result in corrupted replicas at the disaster recovery site Can it happen to me? This is a common gap found in large environments were multiple RDF groups are needed and only revealed during a RecoverGuard scan. The most common occurrence comes from the provision process when storage volumes from different RDF groups are provisioned to the host and used by the database. The provisioning tools do not alert or prevent provisioning storage to the same host that are in two different RDF groups. Relevant Storage Vendors : EMC Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All IMPORTANT NOTE : A similar gap is relevant for ALL storage vendors – when a replicated database or a file system spans multiple arrays. Replica data consistency is not ensure between arrays. This is a common gap in a multi-array environment.
The signature Inconsistent access to storage volumes by cluster nodes The impact In case of fail-over or switch-over to the passive node, data will not become available. Service groups will fail to go online. The result: DOWNTIME Technical details In this example, a database is running on the cluster active node and it stored on 3 storage volumes. Only 2 of these 3 volumes is mapped (accessible) by the cluster passive node. Can it happen to me? VERY common gap. When a new storage volume needed, typically mapped ONLY to the currently active node… Relevant Cluster Software: All Relevant Storage Vendors : All Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature In this example, a copy of production data is accessed by the designated standby, but also, unintentionally, by an unauthorized host The impact During a disaster, a racing condition will develop, as a result of which several unpleasant outcomes might arise: Scenario 1 - the unauthorized host might gain exclusive access to the erroneously mapped disk. In such a case, the designated standby could not mount and use the file-system. By the time the problem is isolated and fixed (which could take a long while), there is also the risk of the unauthorized host actually using the erroneously mapped disk, thereby rendering recovery impossible Scenario 2 – both the standby and unauthorized host might get concurrent access to the disk. If the unauthorized host will attempt to use the erroneously mapped disk, not only will the data get corrupt instantly, the now active standby might unexpectedly crash. Technical details Scenario 1 will occur if the disk is configured for mutual exclusive access . The first host to attempt access to the disk will exclusive access, locking the other from use. Scenario 2 will occur if the disk is multi-homed, or non-locked. Most filesystems in the market were developed under the assumption that an external modification of devices is not possible. This stems from the days only DAS was used and remains mostly unchanged. Clustered filesystems are also vulnerable to the same threat; although they do allow for multiple hosts accessing the same disk, they all assume that any such host is actually part of the cluster and therefore conforms to a predictable behavior. Some operating systems react violently to external tampering of their intrinsic data structure, which could result in a crash. Can it happen to me? This is a very common gap, found in around 80% of the environments we have scanned. There are dozens or reasons it could happen, and with nearly each one of these, it can remain dormant, only to be revealed during actual disaster. Here are some examples: Some arrays default to mapping all devices to all available ports when installed out-of-the box. It is the duty of the end-user to “prune” or restrict access by either re-defining the mapping on the array, and using masking on SAN ports or host HBA (or all of the above). It is easy to miss some spots. Furthermore, even if masking is used successfully at a certain time, any maintenance activity to the unauthorized host, including moving it to another SAN port or changing a failed HBA might give rise to erroneous mapping The erroneously mapped disk may have actually belong to the unauthorized host in the past, and then reclaimed, neglecting to remove the mapping definition from the storage array From time to time, extra mapping may be added to increase performance or resiliency of access to the disk. If zoning and masking are not controlled and managed from a central point, one of the paths might actually get “astray” Sometimes HBAs are replaced not because they are faulty, but rather since a greater capacity or is required. If soft-zoning is used and not updated accordingly, once such an old HBA is re-used on a different host, it may actually get that host access rights to the SAN devices allowed for the original host Many other possibilities exist Relevant Storage Vendors : All Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Replication age inconsistency The impact The impact of inconsistent point-in-time copy devices, such as, a BCV, Clone, or Snap Volume, is if the data is needed for a recovery purpose, the data contained in the copy is corrupted due to devices being out of sync with each other. In a SRDF replication strategy, point-in-time copies safe guard against rolling disasters. Rolling disasters are when data corruption is replicated to the disaster recovery replica as well. Point-in-time copies become the disk based recovery. Technical details In this example, multiple point-in-time copy groups are associated with a volume group that contains three storage volumes for a production database. A device in each the point-in-time group is in the wrong point-in-time group. Any data contained across the device group would not be usable. Can it happen to me? Environments relying on rolling or revolving point-in-time copies often have this gap because they are not mounted and regularly used by other processes, . The gap is created when one or more devices are referenced in the wrong split and establish scripts. Relevant Storage Vendors : All Relevant Replication Methods : All (every method can be used to create point-in-time copies) Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Configuration drifts between production and its standby DR host The impact In the event of a disaster, fail over to the DR server will not be successful. Manual intervention will be needed to install missing hardware/software, upgrade software and configure kernel parameters correctly. This typically involves extended recovery time and an RTO violation, since the identification of the configuration errors commonly takes days (or even weeks). Technical details In this example, the corresponding DR server of a production host does not have enough resources to run the application with reasonable performance. Also, a few products are missing on the DR server while others have lower versions than what is installed on production. In addition, kernel parameters are configured with significantly lower values than in production. Typically, many applications depend on other products installed on the server and on kernel parameters configuration. For example, it is well known that Oracle is sensitive to configuration of semaphores-related kernel parameters. Can it happen to me? This is a very common gap found on DR environments. The configuration of a host involves so many details it can be very difficult to have a DR server fully synchronized to its production host at all times. Also, DR tests typically do not involve loading DR with expected production load, thus these configuration issues go undetected. Relevant Operating Systems : All (Windows, Solaris, HPUX, AIX, Linux)
The signature Configuration drifts between HA cluster nodes The impact This will vary depending upon the specific drift, but can include a failure to switch-over/fail-over/switch-over to other node (causing downtime), or reduced performance after fail-over/switch-over which will, at best, create an operations slowdown and at worst leave the node unable to carry the load Technical details In this example, the passive node does not have redundancy in the HBA level nor in the DNS configuration. The currently active node is configured with redundancy for these elements. A single HBA/DNS server configuration is a single point of failure. Upon fail-over/switch-over to the currently passive node, the applications running on this cluster will suffer from reduced availability/MTBF and more downtime. In addition, the passive node is configured with significantly less maximum allowed open files, which may lead to application failures. Moreover, the passive node has only 1GB of swap while the active node was configured with additional 4GB. Upon fail-over, the applications may not have sufficient memory to run properly. Lastly, differences in installed products may have various impacts, depending on the product type. Can it happen to me? This situation occurs frequently in HA environments. The configuration of a host involves so many details that is it very difficult to ensure an HA server is fully synchronized to its production host at all times. Relevant Operating Systems : All (Windows, Solaris, HPUX, AIX, Linux)
The signature Production data accessed with no redundant path The impact The existence of a single array port mapping and a single path increases the chances that this storage volume may become unavailable. This may result in increased MTBF and frequent downtime issues. Also, any application which utilizes this storage volume may suffer from sub-optimal performance since I/O load balancing is unavailable (single path from host to the storage array). Technical details Typically in production environments it is considered a best practice to: - Configure multiple LUN maps (array port mapping) for a storage volume - Configure multiple paths for a storage volume In the example above, a database is stored on three storage volumes. Two of these volumes are configured according to these best practices. However, a third volume which was recently added doesn't comply with the best practices and has only a single array port mapping and a single I/O path. Can it happen to me? Yes. In production environments urgent requests are not infrequently, such as the need to add more storage space to specific business services. While handling such urgent matters, details such as redundancy in array port mappings and SAN I/O paths may be forgotten. After the change, everything works properly so the error goes unnoticed. The gap will only be discovered when a recovery is required. Relevant Storage Vendors : All Relevant HBA Vendors: All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature In this example, a critical file system is stored on three SAN volumes. The data is periodically synchronized, but it so happens that the copies are not of the exact same age. The impact The existence of such a scenario means that the copy is likely to be corrupt and unusable. If the file system is busy or servers access large files (such as database files which usually meet both criteria) it is extremely likely it would be corrupt. Technical details File systems have certain built-in self correction mechanisms, targeted at overcoming slight differences resulting from pending writes, unsuccessfully flushed from memory to disk as a result of abrupt shutdown (such as a power-failure, or “blue-screen”). These mechanism are not designed to handle disks which appear to “go back in time” minutes or hours. Replication of disks at various points in time could easily lead to such scenarios which would seem completely “unnatural” to the operating system at the DR site. Journaled file-systems will not help, because they either: (a) journal only files system metadata, and not the data itself; and (b) keep journal data spread on the disks themselves; which is also prone to the same time-difference corruption. Can it happen to me? This is one of the top-5 gaps found at even to most well-kept environment. There are dozens or reasons it could happen, and with nearly each one of these, it is nearly impossible to tell that the problem had happened. Because replication itself is successful, there is no indication to the user that something is wrong. Some examples are: All the disk synchs are correctly managed by one script, but there is another out there that runs afterwards, perhaps on a different host, which has a stray mapping to one of the source disks. All the disks are added to one array consistency group (or device group) which is used to synch them simultaneously. Note that the definition of the array consistency group is completely separate from the definition of the filesystem and underlying logical volume and volume group. It is easy to associate a disk newly added to the Volume Group on the host side to the wrong array consistency group There are dozens of permutations and variations of the same theme One of the disks is copied over a separate cross-array link than the others do. This link might be much busier and cause synch (or mirror, or split, etc. – depending on the vendor terminology) to take more time. Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Mixed Storage Types The impact In the event that a disaster replica is needed for DR purposes, these disaster recovery replica will be unusable and resulting in data loss. The production database or file system replication is incomplete or inconsistent and will not be recoverable from the replication technology. Data will need to be recovered at the disaster recovery site and restored from a recent backup increasing the time to recovery. Technical details In this example, the production database is across three storage volumes. The intent is to replicate these production storage volumes to the disaster recovery site, however, one production storage volume is not of the same storage type and is actual a local disk and therefore not being replicated. The result is a incomplete replica at the disaster recovery site. Can it happen to me? This is a common gap found in highly evolving environments with many teams involved in the provisioning process. The handoffs in the provisioning process involving the storage team, platform team and the database teams are complex and many times mixed storage (including local, EMC, NetApp etc.) devices are used to create volume groups (Veritas or other LVM software) in which databases are created or extended on. Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Mixed RAID types The impact The impact of mixing RAID type is far less critical than mixing storage types that require replication. This impact involves potential performance issues and less than optimal storage utilization. Technical details In this example, the production file system contains three storage volumes. Two storage volumes are RAID1 protected storage and one is RAID5 protected storage, which are replicated to the disaster recovery site. In some cases the production volumes are of the same RAID type, however, the disaster replica is of different RAID types and would potentially perform much differently for the production. Can it happen to me? This is a common gap when multiple RAID types are provisioned to the same host for databases, were RAID1 is used for logs and indexes and RAID5 is used for table spaces. Or, when different tiers of storage defined by RAID type are offered to the business. Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature The file system defined within a cluster mount resource is mounted automatically upon booting. The impact Potential data corruption after fail-over, switch-over or node restart. Technical details In this example, the passive node is configured to automatically mount “/d01” on boot. If the passive node is restarted, it will attempt to mount a file system which is already mounted on the currently active node. In this case, data might become corrupted since typically a SAN LUN should only be accessed by a single server at a time. Note that the opposite scenario is problematic as well. If the file system is configured to be mounted automatically on boot on the active node, then the same risk will exist after a fail-over or switch-over. Can it happen to me? This is a very common gap in HA environments because it is difficult to constantly sync the configuration of the server with the cluster configuration. As a result, configuration mismatches, such as the one described above, are created which lead to data protection and availability vulnerabilities Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
Notes: Basic support and gap detection for other clusters as well (HP ServiceGuard, Sun Cluster, Linux Cluster, Microsoft Cluster, RAC). Limited support for VMWare FC. Full support planned for 2009. Support for IBM DS is planned for 2009. Support for EMC SAN-Copy replication is planned for 2009. EMC Cellera is not supported.