Continuity Software 4.3 Detailed Gaps

•

1 j'aime•358 vues

GilHecht

RecoverGuard™ Confidence in Business Continuity

Confidentiality - Important ,[object Object],[object Object],[object Object]

High Availability and DR challenges today ,[object Object],[object Object],[object Object]

Building the right infrastructure… ,[object Object],[object Object],[object Object]

The Problem: Configuration Drift ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],… is only the first step to true HA and DR

The Solution – RecoverGuard™ ,[object Object],[object Object],[object Object],[object Object]

Complete HA/DR analytics solution Availability Management Data Protection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample #1: Partial Replication Result: Data Loss EMC SRDF, TimeFinder (BCV, Clone, Snap) HDS TrueCopy, HUR, ShadowImage, TrueImage NetApp SnapMirror, Snapshot, SnapVault CLARiiON MirrorView, SnapView …

Sample #2: Synchronous Replication RDF Group Replication Inconsistency Result: Data loss, increased time to recover EMC SRDF

Sample #3: Inconsistent Access to Storage by Cluster Result: downtime, increased time to recover

Result: DR failure and data corruption Sample #4: Tampering Risk

Sample #5: Local Replication with BCVs Replication Age Inconsistency Result: Data corruption

Hardware 8 x CPU 2.2Ghz 32 GB RAM 2 x HBA 2 x NIC Software OS: HP-UX 11.31 WebSphere Java 1.5 EMC PowerPath 4.4 Kernel Parameters Max up processes: 8192 Max # of semaphores: 600 Sample #6: Configuration drifts between Production and DR Result: Increased time to recover Hardware 2 x CPU 2.2Ghz 8 GB RAM 1 x HBA 1 x NIC Software OS: HP-UX 11.23 NO WebSphere Java 1.4.2 EMC PowerPath 3.0.5 Kernel Parameters Max up processes: 1024 Max # of semaphores: 128 More differences in the areas of DNS, NTP, Page files, Internet services, patches, etc

Hardware 2 x HBA Software Microsoft .NET 2.0 SP 2 Windows x64 SP 1 Oracle MTS Recovery Service DNS Configurartion 192.168.68.50 192.168.68.51 192.168.2.50 Page Files 1 x 1 GB (c: 1 x 4 GB (d: Kernel Parameters Number of open files: 32767 Sample #7: Configuration drifts between Production and HA Result: Downtime, manual intervention needed to recover

Result: Reduced MTBF, Downtime, Sub-optimal performance Sample #8 - SAN I/O path - single point of failure

Result: File System not usable at the DR site Sample #9: Replica create time inconsistency

[object Object],[object Object],[object Object],Sample #10: Mixed storage types

if RAID1 needed: Data protection issue, reduced MTBF, suboptimal performance Otherwise: Saving opportunity (if RAID5 needed) Sample #11: Mixed RAID levels

Result: Potential Data corruption Sample #12: Cluster Node Configured to Mount on Boot

How it works  Windows 2003 Server  Oracle 10g schema  SYMCLI/NaviCLI “ proxy ” for EMC Symmetrix / CLARiiON  StorageScope API for EMC ECC  SSH / WMI using valid user credentials  JDBC using valid user credentials  IE6+ web client Java 1.5+  HiCommand API for HDS HiCommand  SSH / Telnet for NetApp filers Storage arrays Hosts DB2 Databases

Support matrix Manual Configuration required ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

For more information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contenu connexe

Tendances

Coherence Overview - OFM Canberra July 2014

Joelith

Caching

Nascenia IT

Snapshots have been a key feature of primary storage infrastructures that IT professionals have relied on for years. But storage systems have traditionally been able to support only a limited number of active snapshots. And snapshots, being pointers and not actual data, are also susceptible to a primary storage system failure. As a result, most IT professionals use snapshots sparingly for protecting data. In this webinar Storage Switzerland and Nexenta show you how primary storage can be architected so that snapshots are able to meet almost all of the data protection requirements an organization has.

Webinar: How Snapshots CAN be Backups

Storage Switzerland

Presentation data domain advanced features and functions

xKinAnx

Data core makes_ha_nas_practical_20mar12

jelenaveskovic

Data Grids with Oracle Coherence

Ben Stopford

Technical track 1: arcserve UDP deep dvie

arcserve data protection

HDFS Federation++

Hortonworks

GFS - Google File System

tutchiio

In an enterprise environment, a data center VM footprint can grow quickly; large-scale deployments of thousands of virtual machines are becoming increasingly common. Risk of failure grows proportionally to the number of systems deployed and critical failures are unavoidable. Your ability to offer data protection from a backup solution is critical to business continuity. Elongated, inefficient protection windows can create resource contention with production environments, therefore, it is critical to execute system backup in a finite window of time. The Veritas NetBackup Integrated Appliance running NetBackup 7.6 offered application protection to 1,000 VMs in 80.3 percent less time in SAN testing and used NetApp array-based snapshots to create recovery points in 93.8 percent less time than Competitor “C.” In addition, the Veritas NetBackup Integrated Appliance with NetBackup 7.6 created backup images that offered granular recovery without additional steps and in a backup window 69.0 percent shorter than the backup window needed for Competitor “C.” These time savings can scale as your VM footprint grows, allowing you to execute both system protection and user-friendly, simplified recovery.

Veritas NetBackup benchmark comparison: Data protection in a large-scale virt...

Principled Technologies

EMC Data Domain Retention Lock Software: Detailed Review

EMC

Mike Resseler - Deduplication in windows server 2012 r2

Nordic Infrastructure Conference

Ibm spectrum scale fundamentals workshop for americas part 8 spectrumscale ba...

xKinAnx

A common request sent from your web browser to a web server goes quite a long way and it can take a great deal of time until the data your browser can display are fetched back. I will talk about making this great deal of time significantly less great by caching things on different levels, starting with client-side caching for faster display and minimizing transferred data, storing results of already performed operations and computations and finishing with lowering the load of database servers by caching result sets. Cache expiration and invalidation is the hardest part so I will cover that too. Presentation will be focused mainly on PHP, but most of the principles are quite general work elsewhere too.

Caching Strategies

Michal Špaček

Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5

Doug O'Flaherty

Deduplication reduces the amount of disk storage needed to retain and protect data by ratios of 10-30x and greater, making a disk a cost-effective alternative to tape. Data on disk is available online and onsite for longer retention periods, and restores become fast and reliable. Storing only unique data on disk also means that data can be cost-effectively replicated over existing networks to remote sites for disaster recovery and consolidated tape operations.

EMC Deduplication Fundamentals

emcbaltics

Arcserve udp recovery point server and global deduplication 12-2014

Gina Tragos

Avamar Run Book - 5-14-2015_v3

Bill Oliver

Delphix database virtualization v1.0

Arik Lev

The Symantec NetBackup Platform is a complete backup and recovery solution that is optimized for virtually any workload, including physical, virtual, arrays, or big data infrastructures. NetBackup delivers flexible target storage options, such as tape, 3rd-party disk, cloud, or appliance storage devices, including the NetBackup Deduplication Appliances and Integrated Backup Appliances. NetBackup 7.6 delivers the performance, automation, and manageability necessary to protect virtualized deployments at scale – where thousands of Virtual Machines and petabytes of data are the norm today, and where software-defined data centers and IT-as-a-service become the norm tomorrow. Enterprises trust Symantec.

TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology

Symantec

Tendances (20)

Coherence Overview - OFM Canberra July 2014

Caching

Webinar: How Snapshots CAN be Backups

Presentation data domain advanced features and functions

Data core makes_ha_nas_practical_20mar12

Data Grids with Oracle Coherence

Technical track 1: arcserve UDP deep dvie

HDFS Federation++

GFS - Google File System

Veritas NetBackup benchmark comparison: Data protection in a large-scale virt...

EMC Data Domain Retention Lock Software: Detailed Review

Mike Resseler - Deduplication in windows server 2012 r2

Ibm spectrum scale fundamentals workshop for americas part 8 spectrumscale ba...

Caching Strategies

Introducing IBM Spectrum Scale 4.2 and Elastic Storage Server 3.5

EMC Deduplication Fundamentals

Arcserve udp recovery point server and global deduplication 12-2014

Avamar Run Book - 5-14-2015_v3

Delphix database virtualization v1.0

TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology

Similaire à Continuity Software 4.3 Detailed Gaps

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Bhupesh Bansal

Hadoop and Voldemort @ LinkedIn

Hadoop User Group

Perforce administrators have several choices for HA/DR solutions depending on RTO/RPO objectives. Using an effective storage solution such as NetApp filers simplifies HA/DR planning in several ways. In this session we'll look at using a NetApp filer for more reliable HA in the event of storage or application failure and simpler DR replication. In the latter case, deduplication and SnapMirror technology can significantly reduce the amount of data replicated to a remote site.

[NetApp] Simplified HA:DR Using Storage Solutions

Perforce

Secure Hadoop Cluster With Kerberos

Edureka!

Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.

Getting Started with Amazon Redshift

Amazon Web Services

MYSQL

gilashikwa

Hadoop training in bangalore-kellytechnologies

appaji intelhunt

Ionut hrubaru, bogdan lazarescu sql server high availability

Codecamp Romania

Are you storing larger than necessary quantities in your data warehouse, RDBMS, and line of business applications? Are you spending a large portion of your budget on Teradata or Netezza with costs continually climbing as data volumes grow? Are you getting the right ROI for all the data you store in your data warehouses? Read this deck to find out: What is the cost of storing your critical Big Data assets? What workloads are best suited for data warehouses, which for Hadoop, and why? Advantages of running Hadoop on scale-out NAS. Importance of Security and Data Governance for critical data assets. How to maintain data warehouse performance even with high growth rates.

Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...

RainStor

Availability Considerations for SQL Server

Bob Roudebush

Track 2, session 3, business continuity and disaster recovery in the virtuali...

EMC Forum India

SQL Server High Availability and Disaster Recovery

Michael Poremba

Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.

Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...

DataWorks Summit

Introduction to hadoop and hdfs

shrey mehrotra

Building Low Cost Scalable Web Applications Tools & Techniques

rramesh

Champion Fas Deduplication

Michael Hudak

Storage essentials (by Merlin Ran)

gigix1980

seed block algorithm

Dipak Badhe

In this talk, we review a real-world use case that tested the Cassandra+Spark stack on Datastax Enterprise (DSE). We also cover implementation details around application high availability and fault tolerance using the new DSE File System (DSEFS). From a field and testing perspective, we discuss the strategies we can leverage to meet our requirements. Such requirements include (but not limited to) functional coverage, system integration, usability, and performance. We will discuss best practices and lessons we learned covering everything from application development to DSE setup and tuning. About the Speaker Rocco Varela Software Engineer in Test, DataStax After earning his PhD in bioinformatics from UCSF, Rocco Varela took his passion for technology to DataStax. At DataStax he works on several aspects of performance and test automation around DataStax Enterprise (DSE) integrated offerings such as Apache Spark, Hadoop, Solr, and more recently DSE Graph.

DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...

DataStax

Oracle Exec Summary 7000 Unified Storage

David R. Klauser

Similaire à Continuity Software 4.3 Detailed Gaps (20)

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

[NetApp] Simplified HA:DR Using Storage Solutions

Secure Hadoop Cluster With Kerberos

Getting Started with Amazon Redshift

MYSQL

Hadoop training in bangalore-kellytechnologies

Ionut hrubaru, bogdan lazarescu sql server high availability

Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...

Availability Considerations for SQL Server

Track 2, session 3, business continuity and disaster recovery in the virtuali...

SQL Server High Availability and Disaster Recovery

Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...

Introduction to hadoop and hdfs

Building Low Cost Scalable Web Applications Tools & Techniques

Champion Fas Deduplication

Storage essentials (by Merlin Ran)

seed block algorithm

DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...

Oracle Exec Summary 7000 Unified Storage

Continuity Software 4.3 Detailed Gaps

1. RecoverGuard™ Confidence in Business Continuity

3. Continuity Software Inc.

9. Recoverability & Availability Dashboard

10. RecoverGuard™ ticket

11. Sample finding

12. Sample #1: Partial Replication Result: Data Loss EMC SRDF, TimeFinder (BCV, Clone, Snap) HDS TrueCopy, HUR, ShadowImage, TrueImage NetApp SnapMirror, Snapshot, SnapVault CLARiiON MirrorView, SnapView …

13. Sample #2: Synchronous Replication RDF Group Replication Inconsistency Result: Data loss, increased time to recover EMC SRDF

14. Sample #3: Inconsistent Access to Storage by Cluster Result: downtime, increased time to recover

15. Result: DR failure and data corruption Sample #4: Tampering Risk

16. Sample #5: Local Replication with BCVs Replication Age Inconsistency Result: Data corruption

17. Hardware 8 x CPU 2.2Ghz 32 GB RAM 2 x HBA 2 x NIC Software OS: HP-UX 11.31 WebSphere Java 1.5 EMC PowerPath 4.4 Kernel Parameters Max up processes: 8192 Max # of semaphores: 600 Sample #6: Configuration drifts between Production and DR Result: Increased time to recover Hardware 2 x CPU 2.2Ghz 8 GB RAM 1 x HBA 1 x NIC Software OS: HP-UX 11.23 NO WebSphere Java 1.4.2 EMC PowerPath 3.0.5 Kernel Parameters Max up processes: 1024 Max # of semaphores: 128 More differences in the areas of DNS, NTP, Page files, Internet services, patches, etc

18. Hardware 2 x HBA Software Microsoft .NET 2.0 SP 2 Windows x64 SP 1 Oracle MTS Recovery Service DNS Configurartion 192.168.68.50 192.168.68.51 192.168.2.50 Page Files 1 x 1 GB (c: 1 x 4 GB (d: Kernel Parameters Number of open files: 32767 Sample #7: Configuration drifts between Production and HA Result: Downtime, manual intervention needed to recover

19. Result: Reduced MTBF, Downtime, Sub-optimal performance Sample #8 - SAN I/O path - single point of failure

20. Result: File System not usable at the DR site Sample #9: Replica create time inconsistency

21.

22. if RAID1 needed: Data protection issue, reduced MTBF, suboptimal performance Otherwise: Saving opportunity (if RAID5 needed) Sample #11: Mixed RAID levels

23. Result: Potential Data corruption Sample #12: Cluster Node Configured to Mount on Boot

24. How it works  Windows 2003 Server  Oracle 10g schema  SYMCLI/NaviCLI “ proxy ” for EMC Symmetrix / CLARiiON  StorageScope API for EMC ECC  SSH / WMI using valid user credentials  JDBC using valid user credentials  IE6+ web client Java 1.5+  HiCommand API for HDS HiCommand  SSH / Telnet for NetApp filers Storage arrays Hosts DB2 Databases

25.

26.

27. Thank You

Notes de l'éditeur

Continuity got the product that can find all those problems as they happen – instead of waiting a full year (which means no protection during the year)
The RecoverGuard dashboard provides concise and valuable information regarding your DR coverage and status – in a glance. The top-left pane provides information regarding the last scan’s coverage, identifies hosts, databases, storage arrays and business services (or processes) scanned. It also points out which areas could not be reached to let the user decide on the appropriate action. Clicking the pane will reveal a detailed scan report, including scan history and statistics. The middle left pane provides a “snapshot” of the current business service protection state – identifying risks to data and system availability, as well as optimization opportunities. Clicking on any business service will reveal a more detailed information view and will allow for easy navigation into specific gap tickets The bottom tabbed view displays the top 5 currently-open tickets, as well as the top 5 recently detected ones. Clicking each ticket will open the appropriate ticket details view (see examples in next slides). Notice that each ticket is ranked by its threat level. The threat level computation process weighs, among other considerations, the following: The importance of the involved business service The role of the resource identified by the ticket (for example, is it production data? Is it a replica used for DR? is it a replica used for QA? Etc. Obviously the risk is different for each case) The technical severity of the identified gap (for example: is it a data incompleteness or inconsistency issue? Is it just a minor improvement opportunity?) As a result, the user can easily focus on the most important issues, in an educated fashion The two charts on the right provide statistics and trend information regarding identified risks.
The signature Replication Tree Structure Inconsistency The impact In case of disaster, data will be lost. A production database, volume group, disk drive or file system is partially replicated. Data is not recoverable. Technical details In this example, the production database is across three storage volumes. The intent is to replicate these production storage volumes to the disaster recovery site, however, one production storage volume is missing an assigned replication device. Can it happen to me? This is a very common gap found in the environments we have scanned. There are many reasons it could happen, only to be revealed during actual disaster. The most common reason is the production storage volume is not added to the device group to be replicated Relevant Storage Vendors : All Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Replication Inconsistency – Different RDF groups The impact In the event that one RDF group becomes out of sync with the other RDG group the database at the disaster recovery site would be corrupt and will not be recoverable from the replication technology. Data will need to be restored from a recent backup increasing the time to recovery. Technical details The storage volumes that are used for the database are in two different RDF groups. This is stated by EMC not to be a good practice if the RDF groups are not in a consistency group. Each RDF group is associated with different replication adapters and potentially different network infrastructures which can have failures independent of the other RDF group, which would result in corrupted replicas at the disaster recovery site Can it happen to me? This is a common gap found in large environments were multiple RDF groups are needed and only revealed during a RecoverGuard scan. The most common occurrence comes from the provision process when storage volumes from different RDF groups are provisioned to the host and used by the database. The provisioning tools do not alert or prevent provisioning storage to the same host that are in two different RDF groups. Relevant Storage Vendors : EMC Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All IMPORTANT NOTE : A similar gap is relevant for ALL storage vendors – when a replicated database or a file system spans multiple arrays. Replica data consistency is not ensure between arrays. This is a common gap in a multi-array environment.
The signature Inconsistent access to storage volumes by cluster nodes The impact In case of fail-over or switch-over to the passive node, data will not become available. Service groups will fail to go online. The result: DOWNTIME Technical details In this example, a database is running on the cluster active node and it stored on 3 storage volumes. Only 2 of these 3 volumes is mapped (accessible) by the cluster passive node. Can it happen to me? VERY common gap. When a new storage volume needed, typically mapped ONLY to the currently active node… Relevant Cluster Software: All Relevant Storage Vendors : All Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature In this example, a copy of production data is accessed by the designated standby, but also, unintentionally, by an unauthorized host The impact During a disaster, a racing condition will develop, as a result of which several unpleasant outcomes might arise: Scenario 1 - the unauthorized host might gain exclusive access to the erroneously mapped disk. In such a case, the designated standby could not mount and use the file-system. By the time the problem is isolated and fixed (which could take a long while), there is also the risk of the unauthorized host actually using the erroneously mapped disk, thereby rendering recovery impossible Scenario 2 – both the standby and unauthorized host might get concurrent access to the disk. If the unauthorized host will attempt to use the erroneously mapped disk, not only will the data get corrupt instantly, the now active standby might unexpectedly crash. Technical details Scenario 1 will occur if the disk is configured for mutual exclusive access . The first host to attempt access to the disk will exclusive access, locking the other from use. Scenario 2 will occur if the disk is multi-homed, or non-locked. Most filesystems in the market were developed under the assumption that an external modification of devices is not possible. This stems from the days only DAS was used and remains mostly unchanged. Clustered filesystems are also vulnerable to the same threat; although they do allow for multiple hosts accessing the same disk, they all assume that any such host is actually part of the cluster and therefore conforms to a predictable behavior. Some operating systems react violently to external tampering of their intrinsic data structure, which could result in a crash. Can it happen to me? This is a very common gap, found in around 80% of the environments we have scanned. There are dozens or reasons it could happen, and with nearly each one of these, it can remain dormant, only to be revealed during actual disaster. Here are some examples: Some arrays default to mapping all devices to all available ports when installed out-of-the box. It is the duty of the end-user to “prune” or restrict access by either re-defining the mapping on the array, and using masking on SAN ports or host HBA (or all of the above). It is easy to miss some spots. Furthermore, even if masking is used successfully at a certain time, any maintenance activity to the unauthorized host, including moving it to another SAN port or changing a failed HBA might give rise to erroneous mapping The erroneously mapped disk may have actually belong to the unauthorized host in the past, and then reclaimed, neglecting to remove the mapping definition from the storage array From time to time, extra mapping may be added to increase performance or resiliency of access to the disk. If zoning and masking are not controlled and managed from a central point, one of the paths might actually get “astray” Sometimes HBAs are replaced not because they are faulty, but rather since a greater capacity or is required. If soft-zoning is used and not updated accordingly, once such an old HBA is re-used on a different host, it may actually get that host access rights to the SAN devices allowed for the original host Many other possibilities exist Relevant Storage Vendors : All Relevant Replication Methods : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Replication age inconsistency The impact The impact of inconsistent point-in-time copy devices, such as, a BCV, Clone, or Snap Volume, is if the data is needed for a recovery purpose, the data contained in the copy is corrupted due to devices being out of sync with each other. In a SRDF replication strategy, point-in-time copies safe guard against rolling disasters. Rolling disasters are when data corruption is replicated to the disaster recovery replica as well. Point-in-time copies become the disk based recovery. Technical details In this example, multiple point-in-time copy groups are associated with a volume group that contains three storage volumes for a production database. A device in each the point-in-time group is in the wrong point-in-time group. Any data contained across the device group would not be usable. Can it happen to me? Environments relying on rolling or revolving point-in-time copies often have this gap because they are not mounted and regularly used by other processes, . The gap is created when one or more devices are referenced in the wrong split and establish scripts. Relevant Storage Vendors : All Relevant Replication Methods : All (every method can be used to create point-in-time copies) Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Configuration drifts between production and its standby DR host The impact In the event of a disaster, fail over to the DR server will not be successful. Manual intervention will be needed to install missing hardware/software, upgrade software and configure kernel parameters correctly. This typically involves extended recovery time and an RTO violation, since the identification of the configuration errors commonly takes days (or even weeks). Technical details In this example, the corresponding DR server of a production host does not have enough resources to run the application with reasonable performance. Also, a few products are missing on the DR server while others have lower versions than what is installed on production. In addition, kernel parameters are configured with significantly lower values than in production. Typically, many applications depend on other products installed on the server and on kernel parameters configuration. For example, it is well known that Oracle is sensitive to configuration of semaphores-related kernel parameters. Can it happen to me? This is a very common gap found on DR environments. The configuration of a host involves so many details it can be very difficult to have a DR server fully synchronized to its production host at all times. Also, DR tests typically do not involve loading DR with expected production load, thus these configuration issues go undetected. Relevant Operating Systems : All (Windows, Solaris, HPUX, AIX, Linux)
The signature Configuration drifts between HA cluster nodes The impact This will vary depending upon the specific drift, but can include a failure to switch-over/fail-over/switch-over to other node (causing downtime), or reduced performance after fail-over/switch-over which will, at best, create an operations slowdown and at worst leave the node unable to carry the load Technical details In this example, the passive node does not have redundancy in the HBA level nor in the DNS configuration. The currently active node is configured with redundancy for these elements. A single HBA/DNS server configuration is a single point of failure. Upon fail-over/switch-over to the currently passive node, the applications running on this cluster will suffer from reduced availability/MTBF and more downtime. In addition, the passive node is configured with significantly less maximum allowed open files, which may lead to application failures. Moreover, the passive node has only 1GB of swap while the active node was configured with additional 4GB. Upon fail-over, the applications may not have sufficient memory to run properly. Lastly, differences in installed products may have various impacts, depending on the product type. Can it happen to me? This situation occurs frequently in HA environments. The configuration of a host involves so many details that is it very difficult to ensure an HA server is fully synchronized to its production host at all times. Relevant Operating Systems : All (Windows, Solaris, HPUX, AIX, Linux)
The signature Production data accessed with no redundant path The impact The existence of a single array port mapping and a single path increases the chances that this storage volume may become unavailable. This may result in increased MTBF and frequent downtime issues. Also, any application which utilizes this storage volume may suffer from sub-optimal performance since I/O load balancing is unavailable (single path from host to the storage array). Technical details Typically in production environments it is considered a best practice to: - Configure multiple LUN maps (array port mapping) for a storage volume - Configure multiple paths for a storage volume In the example above, a database is stored on three storage volumes. Two of these volumes are configured according to these best practices. However, a third volume which was recently added doesn't comply with the best practices and has only a single array port mapping and a single I/O path. Can it happen to me? Yes. In production environments urgent requests are not infrequently, such as the need to add more storage space to specific business services. While handling such urgent matters, details such as redundancy in array port mappings and SAN I/O paths may be forgotten. After the change, everything works properly so the error goes unnoticed. The gap will only be discovered when a recovery is required. Relevant Storage Vendors : All Relevant HBA Vendors: All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature In this example, a critical file system is stored on three SAN volumes. The data is periodically synchronized, but it so happens that the copies are not of the exact same age. The impact The existence of such a scenario means that the copy is likely to be corrupt and unusable. If the file system is busy or servers access large files (such as database files which usually meet both criteria) it is extremely likely it would be corrupt. Technical details File systems have certain built-in self correction mechanisms, targeted at overcoming slight differences resulting from pending writes, unsuccessfully flushed from memory to disk as a result of abrupt shutdown (such as a power-failure, or “blue-screen”). These mechanism are not designed to handle disks which appear to “go back in time” minutes or hours. Replication of disks at various points in time could easily lead to such scenarios which would seem completely “unnatural” to the operating system at the DR site. Journaled file-systems will not help, because they either: (a) journal only files system metadata, and not the data itself; and (b) keep journal data spread on the disks themselves; which is also prone to the same time-difference corruption. Can it happen to me? This is one of the top-5 gaps found at even to most well-kept environment. There are dozens or reasons it could happen, and with nearly each one of these, it is nearly impossible to tell that the problem had happened. Because replication itself is successful, there is no indication to the user that something is wrong. Some examples are: All the disk synchs are correctly managed by one script, but there is another out there that runs afterwards, perhaps on a different host, which has a stray mapping to one of the source disks. All the disks are added to one array consistency group (or device group) which is used to synch them simultaneously. Note that the definition of the array consistency group is completely separate from the definition of the filesystem and underlying logical volume and volume group. It is easy to associate a disk newly added to the Volume Group on the host side to the wrong array consistency group There are dozens of permutations and variations of the same theme One of the disks is copied over a separate cross-array link than the others do. This link might be much busier and cause synch (or mirror, or split, etc. – depending on the vendor terminology) to take more time. Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Mixed Storage Types The impact In the event that a disaster replica is needed for DR purposes, these disaster recovery replica will be unusable and resulting in data loss. The production database or file system replication is incomplete or inconsistent and will not be recoverable from the replication technology. Data will need to be recovered at the disaster recovery site and restored from a recent backup increasing the time to recovery. Technical details In this example, the production database is across three storage volumes. The intent is to replicate these production storage volumes to the disaster recovery site, however, one production storage volume is not of the same storage type and is actual a local disk and therefore not being replicated. The result is a incomplete replica at the disaster recovery site. Can it happen to me? This is a common gap found in highly evolving environments with many teams involved in the provisioning process. The handoffs in the provisioning process involving the storage team, platform team and the database teams are complex and many times mixed storage (including local, EMC, NetApp etc.) devices are used to create volume groups (Veritas or other LVM software) in which databases are created or extended on. Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature Mixed RAID types The impact The impact of mixing RAID type is far less critical than mixing storage types that require replication. This impact involves potential performance issues and less than optimal storage utilization. Technical details In this example, the production file system contains three storage volumes. Two storage volumes are RAID1 protected storage and one is RAID5 protected storage, which are replicated to the disaster recovery site. In some cases the production volumes are of the same RAID type, however, the disaster replica is of different RAID types and would potentially perform much differently for the production. Can it happen to me? This is a common gap when multiple RAID types are provisioned to the same host for databases, were RAID1 is used for logs and indexes and RAID5 is used for table spaces. Or, when different tiers of storage defined by RAID type are offered to the business. Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
The signature The file system defined within a cluster mount resource is mounted automatically upon booting. The impact Potential data corruption after fail-over, switch-over or node restart. Technical details In this example, the passive node is configured to automatically mount “/d01” on boot. If the passive node is restarted, it will attempt to mount a file system which is already mounted on the currently active node. In this case, data might become corrupted since typically a SAN LUN should only be accessed by a single server at a time. Note that the opposite scenario is problematic as well. If the file system is configured to be mounted automatically on boot on the active node, then the same risk will exist after a fail-over or switch-over. Can it happen to me? This is a very common gap in HA environments because it is difficult to constantly sync the configuration of the server with the cluster configuration. As a result, configuration mismatches, such as the one described above, are created which lead to data protection and availability vulnerabilities Relevant Storage Vendors : All Relevant Operating Systems : All Relevant DBMS Vendors : All
Notes: Basic support and gap detection for other clusters as well (HP ServiceGuard, Sun Cluster, Linux Cluster, Microsoft Cluster, RAC). Limited support for VMWare FC. Full support planned for 2009. Support for IBM DS is planned for 2009. Support for EMC SAN-Copy replication is planned for 2009. EMC Cellera is not supported.

Continuity Software 4.3 Detailed Gaps

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Continuity Software 4.3 Detailed Gaps

Similaire à Continuity Software 4.3 Detailed Gaps (20)

Continuity Software 4.3 Detailed Gaps

Notes de l'éditeur