Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Sorting through the confusion white paper
1. Sorting Through the Confusion
Replacing Tape with Disk for Backups
WHITE PAPER
2. Table of Contents
Introduction .................................................................................................................... 2
Considerations When Examining Disk- Based Backup Approaches........................................... 4
Backup Requirements ....................................................................................................... 7
Backup or Cloud Services .................................................................................................. 8
Disk Staging .................................................................................................................. 10
Primary Storage SNAPS .................................................................................................. 12
Backup Application Deduplication in the Media Server ......................................................... 13
Backup Application Client Side Deduplication ..................................................................... 15
Purpose-Built Target Side Deduplication Appliances ............................................................ 17
Summary ...................................................................................................................... 18
About ExaGrid ............................................................................................................... 19
Sorting Through the Confusion Page 1|
3. Introduction
The reason a 50-year old technology like tape is still around is simple; it’s CHEAP. But there is
increasing pressure on businesses to fix their backups, as detailed in many sources including the
report, “Best Practices for Addressing the Broken State of Backup” by Dave Russell, research
vice president at Gartner. He found that “for many organizations, backup has become an
increasingly daunting and brittle task fraught with significant challenges.”
The pressure of data growth has increased sharply as businesses need to store both onsite and
offsite copies of their data. This can mean storing 40 to 100 times the volume of their primary
dataset, due to storing weeks of retention onsite and weeks, months and in some cases, years of
retention off site.
Onsite copies are kept in order to recover a deleted or overwritten file or to recover from a
system outage, hardware failure or data corruption that goes unnoticed until the data is needed
again, perhaps weeks later. Offsite copies are kept in order to recover data if the primary site
has a disaster.
Maintaining more copies with longer term retention is being driven by business needs such as
SEC Audits, regulations such as the Gramm Leach Bliley Act (GLBA), Health Insurance Portability
and Accountability Act (HIPAA) and Sarbanes Oxley (SOX), legal requirements such as the need
for legal discovery, Service Level Agreements (SLA), contractual reasons and many other
business and legal reasons. The challenge of labeling tapes, transporting tapes, storing them
and ultimately finding the right tape when requested is a challenge in itself. This is compounded
by the fact that the data may not even be on the tapes, since 30% of tapes are found corrupted,
damaged or blank.
Tape has some intrinsic problems
The number of simultaneous jobs that can be writing to tape is determined by the number
of drives in the tape library resulting in unnecessarily long backup times.
Restores fail about 30% of the time from tapes, which can be missing files, have
corrupted files or can be unreadable or blank.
Tapes are physically transported repeatedly and can be lost, misplaced, or stolen in
transit.
Tape labels are handwritten, making them subject to human error. They can also fall off
or be unreadable.
Tapes can be damaged by wear though overuse in the rotation, heat, humidity, magnetic
fields, dirt and other environmental conditions.
Data at rest on tapes is not encrypted. Tape encryption dramatically increases the backup
window. Tapes can be password-protected but require a system to track passwords, which
is subject to human error.
It takes time to restore from tapes and even more time if, for any reason, a tape set is
bad. In that case, you have to fall back to an earlier tape set and start the restore again.
Sorting Through the Confusion Page 2|
4. The time to first find, then retrieve, tapes in transit or at remote locations must be
factored in, as well.
Inertia and confusion have kept tape alive
The market is full of affordable alternatives to tape, but at the same time there is a lot of
confusion about technologies like deduplication. Disk has been making steady inroads on tape as
the primary target for backup software for reasons that go much deeper than price with respect
to tape, or being faster and easier to manage than tape. Because of the ambiguity surrounding
tape alternatives, some businesses are confused when looking to replace tape and tape
solutions.
Disk is just one part of a whole new equation that has emerged where near real- time business
continuity and disaster recovery are the new desired end results. Disk eliminates the daily grind
and uncertainty that typically surrounds backup to tape. Instead, IT staffs get relief from
worrying whether backups and restores are completing successfully, or that backup jobs have
failed.
Disk by its very nature fixes all the intrinsic problems of tape
Volumes or NAS shares are virtual and can have a large number of backup jobs being written
in parallel, reducing the backup window.
Backups and restores are reliable with disk. With tape, up to 15 percent of backups fail and
up to 30% of restores fail. With disk virtually 100% of backups complete and 100% of
restores succeed.
Disks are in a hermetically sealed case inside a temperature and humidity controlled data
center, eliminating the environmental degradation issues of tape.
Disks reside a rack, in a data center, which is in turn is secured by physical and network
security. Therefore, the security issues of tape moving around are eliminated.
There are no handwritten labels that can fall off. The software automatically tags all jobs that
have been written to disk.
In addition to physical and network data center security, data stored on disk can be encrypted
with only a 3 - 4% performance reduction. Encrypting while writing to tape dramatically slows
down the backups.
Restoring data from backups, including incremental backups, is fast from disk. No time is lost
to finding or retrieving tapes in transit or at remote locations. Not only is disk more reliable to
restore from, but it is random access versus tape which is sequential access.
Over the last decade, a range of technologies has emerged that makes it feasible for disk to
replace tape. Disk-based solutions now offer the benefits that only tape once offered, such as
infinite capacity, portability and manageability.
Make better use of resources
The use of disk storage for augmenting tape, or of disk storage and deduplication either
augmenting or eliminating tape, is becoming a more logical investment for organizations. .
Sorting Through the Confusion Page 3|
5. Scarce resources once used to deliver "just" data protection can be repurposed for the strategic
business initiatives of disaster recovery and business continuity.
Those responsible for planning or carrying out backups are looking for tape alternatives that
offer:
Less IT staff time spent on backups, resulting in time to focus on other valuable IT
initiatives
Faster backups
More reliable backups
Faster and more reliable restores
Ability to meet all financial, governmental and legal retention requirements
Achieve all of the above without making any major changes to the current environment
that could create work, risk or change
Considerations When Examining Disk- Based Backup Approaches
Now that it is economically feasible to move from a tape-based to disk-based backup approach,
a large number of vendors with varying approaches have emerged. This has caused a great
amount of confusion for IT managers looking to adopt a disk-based backup system for their
organization.
To help clear away this confusion, this white paper will first present a general overview of
several different deduplication approaches. This section will show how deduplication can store
far less data on a given amount of disk using new technologies that minimize the amount of disk
required. This results in a cost for disk that is about the same cost as tape.
For reference, a chart is included that lists the backup requirements of each of these
approaches. Next, each of the six potential solutions that are often considered to replace tape
with disk will be presented in turn. Information about each approach will be shown, including the
pros and cons of each approach. These six approaches are as follows:
Backup services or cloud backup services
Disk staging – storing data on disk that has been inserted between the media servers and
the tape libraries
Primary storage SNAPs
Backup application data deduplication in the media server writing to standard disk
Backup application data deduplication on server agents (client side) writing to standard
disk
Purpose-built target side appliance with deduplication
Data Deduplication Overview
One of the few remaining arguments for tape is that tape libraries will technically never "run out
of retention capacity". As soon as a tape cartridge fills up, it can be replaced with another tape
Sorting Through the Confusion Page 4|
6. cartridge and the full cartridges can be stored. When writing to disk, storing the same amount of
data that is stored on tape would require a massive amount of disk, resulting in high cost.
However, if you could use a fraction of the space required to store the data on disk and bring the
cost of disk storage close to the cost of tape, then disk is clearly the better alternative. From
week to week, only about 2% of the bytes
change. However, with tape backup 98%
of the unchanged data is backed up
repeatedly, resulting in saving the
identical data dozens and even hundreds
of times.
With disk, deduplication software can
intelligently save only the 2% of the data
that changes from week to week, saving
only the changed data. The net result of
using disk storage and data deduplication
together is you only need 1/20th to 1/50th
of the storage you would need on tape.
Figure 1 - Data Deduplication Taxonomy
Since tape costs about 1/20th the price of
disk per TB of usable capacity, using data
deduplication effectively neutralizes the price gap between tape and disk by using far less disk
space than is required to store the same data on tape. There are many methods to data
deduplication including:
Fixed data block (64KB to 128KB) - used in Backup Software Applications
Changed storage blocks - used in primary storage SNAPS
Byte level - used in target side appliances
Data block with variable content splitting - used in target side appliances
Zone-byte level - used in target side appliances
All of these methods reduce redundant data in backups. For example, if a full backup of 50TB of
data is completed every Friday night, and 10 weeks are kept onsite, it would take 500TB of disk
space to store the backup. However, most of the full backup is unchanged from week to week.
Only the data that has been changed, edited or created that week needs to be stored. On
average, only about 2% of the data changes from week to week. In this example, 2% is about
1TB per week.
Sorting Through the Confusion Page 5|
7. If you were to take out all of the redundant data, over time the storage required can be reduced
by as much as 50:1, depending on the deduplication method used.
Factors Impacting Deduplication
Results
In general, the higher the deduplication
ratios, the better. A higher deduplication
ratio uses less disk space over time and
needs far less WAN bandwidth to
replicate data to the offsite disaster
recover site.
Deduplication Approach
The deduplication approach selected
impacts the amount of storage savings
that will result.
Figure 2 - Deduplication Reduces Storage over Time
64KB to 128KB fixed block will average about 7 to 1
Byte, Segment-block and Zone will average from average from about 20: 1 to 50: 1
reduction in data storage
Data Mix Affects Results
The deduplication ratio can range from 10: 1 to as much as 50: 1, depending on the mix of data
types being backed up. Databases can get very high deduplication ratios of over 100: 1.
Unstructured file data will see an average ratio of 7-10:1. Deduplicating compressed or
encrypted files does not yield a high ratio or significant space savings.
Retention Period
The longer the retention period, the higher the deduplication ratio will be.
Getting the Best Results
The best deduplication ratios will be achieved in environments that are:
Using byte, data block or zone-level deduplication
Backing up no compressed or encrypted data
Retaining data for longer-term periods, on the order of 18 weeks
The worst deduplication ratio will be achieved in environments that are:
Using 64KB or 128KB fixed block deduplication
Backing up a large amount of compressed or encrypted data
Retaining data for shorter-term periods, on the order of 4 weeks or less
The net is that not all deduplication approaches achieve the same results. Deduplication ratios
are clearly impacted by data types and retention periods. All of these factors need to be taken
into consideration when choosing the proper disk backup approach.
Sorting Through the Confusion Page 6|
8. Backup Requirements
The chart below shows the top backup requirements of most IT shops, arranged in priority order. Each of the
approaches, including staying with tape, is shown in its own column. As you can see, not all approaches can meet all
requirements. The key is to list your requirements and match them against each of the solutions to see which solutions
best meet your requirements. The following sections show the strengths and limitations of each of the 6 disk solutions.
9. Backup or Cloud Services
There are many backup or cloud services to which backup can be outsourced, and the market is
evolving as new players enter the field. These services require replacing the server agents used
by the backup application. The service can then remotely manage the backup environment.
At the start, one complete backup of the data
needs to be sent to the backup server. The
logistics of doing this data transfer can be
troublesome, due to the large, sustained
bandwidth required. After the initial full
backup is transferred, just the changes in the
data need to be uploaded to the outsourced
service. Most of these agents only move
changed bytes once the initial full backup is at
the server provider (in the cloud).
Before a cloud backup recovery strategy is
implemented, two key factors should be
considered. First, one should ask what the
recovery point objective (RPO) is for the Figure 3 - Typical Cloud Backup
business service that is being considered.
Second, one should ask what the recovery time objective (RTO) is for the business service.
Be sure to evaluate carefully the claims made in cloud service contracts. The most important of
these contractual promises is the availability of the service, the provider’s service level
agreements (SLAs), and the security of your data.
According to a Yankee Group report 1,”cloud contracts are rife with disclaimers, misleading
uptime guarantees, and questionable privacy policies…”
Strengths
Frees up IT staff to do other core/critical IT tasks
Weaknesses
Requires changing all the server agents from your existing backup application to the
outsourced service backup agents. Any changes to the agents will require weeks or
months of tweaking.
1
http://www.yankeegroup.com/about_us/press_releases/2010-04-21.html
Sorting Through the Confusion Page 8|
10. Good for small amounts of data, typically under 1TB. Best fit for small IT shops or a large
company’s small remote office, but not for multi-TB environments. This limitation is due to
the time needed to recover the data over the Internet. Under normal operation only the
changed bytes or blocks get sent for replication. However, if a full backup restore is
required it would take about 31 days to get 1TB of data over 3MB of bandwidth from the
internet. It is key to note that it is not the bandwidth between site you are using but
rather your bandwidth to the internet.
If the data is over a few TB, most service providers need to place a hardware appliance
(cache) in the IT environment to keep at least one week of backups (including a full
backup) on-site to overcome the recovery bottleneck presented by bandwidth to the
internet. The cost of the cache appliance plus the monthly fees makes a backup or cloud
service the most expensive backup choice if you have more than a few TBs of data to
protect.
Summary
For consumers, small IT environments (<1TB) and small remote offices with a small or
nonexistent IT staff, a small data center (if any) and low bandwidth, a backup service is
the best way to go.
If there is a reasonable amount of data (>1TB) services become too cumbersome and too
costly.
Sorting Through the Confusion Page 9|
11. Disk Staging
Disk staging places a disk between the
media server or storage nodes and the
tape library. This is also considered
tape augmentation.
All backup applications can write
directly to a disk volume or NAS share,
so disk staging works natively with all
backup applications. Disk staging
reduces the perceived backup window
at the client level, reduces the backup
verification window at the server level,
and provides the high speed recovery Figure 4 - Disk Staging Concept Overview
of files from disk, rather than tape.
Strengths
By placing disk between the media servers/storage nodes and the tape library many
problems are solved:
Multiple parallel jobs can be handled, without being limited to the number of
physical tape drives. This results in faster backups, assuming that media servers
can keep up.
Reliable backups and reliable restores for the data are assured using disk.
Weaknesses
Disk staging becomes expensive very quickly:
Disk staging does not eliminate the use of tape onsite or offsite. It simply augments
tape onsite.
There is no data deduplication when using disk staging so the amount of disk grow
very quickly and becomes extremely expensive with any level of retention.
For example, two weeks of nightly backups and weekly full backups require storing four
times the size of the primary data on disk. This assumes a rotation of full backups for
databases and email nightly, incremental backups on files nightly and full backups on
Friday.
Each night, a combination of incremental backups of files and full backups of databases
and email will equal about 25% of a full backup. These Monday through Thursday nightly
backups will add up roughly to the size of a full backup.
Sorting Through the Confusion P a g e 10 |
12. Using 40TB of data as an example, nightly backups after four nights will be 40TB and a
Friday full backup will be 40TB. Together, they will require a total of 80TB of disk storage.
After two weeks, this expands to 160TB of disk storage required. Therefore, 90% of
customers using disk staging keep between one and two weeks of data on disk.
Summary
Disk staging is good for one to two weeks of onsite retention on disk.
It is estimated that about 70% of tape users use disk staging
For retention over one or two weeks, or tape replacement onsite, an organization must
use data deduplication in order to store only unique data (not the redundant data) in
order to use far less disk, thus reducing the cost impact.
Sorting Through the Confusion P a g e 11 |
13. Primary Storage SNAPS
Primary storage SNAPS (a quick logical copy or snapshot) are useful primarily for short-term
retention. They are just the first line of defense in a layered backup scheme that includes long-
term backups. SNAPS save changed
storage blocks on a periodic basis (e.g.
hourly) that allow for roll back to the last
period. Primary storage SNAPS are not
intended for long-term or historical backup.
Strengths
SNAPS allow rolling back to earlier
points and are more granular than a
nightly backup Figure 5 - SNAPS Concept
SNAPS can be replicated offsite for disaster recovery of short-term, periodic SNAP points
Weaknesses
SNAPs write into the same volume as the primary data so they do not offer protection
against a system crash, virus attack, data corruption or other event that destroys the
primary data. The SNAPs would get destroyed along with the primary data. This is why
99% of IT environments keep a backup copy on a separate system onsite (tape or disk).
SNAPs are not good for long-term retention uses such as legal discovery, regulatory
compliance or SEC audits. When years of retention are required, a traditional backup
approach is required due to the need to store data at specific points in time but not every
interval in between, such as monthly backups for 3 years and then yearly backups for 4
additional years.
Summary
Primary Storage SNAPS and long-term traditional backup can co-exist as part of a multi-
layered approach to backup tailored to the specific requirements of the business.
Primary Storage SNAPS provide for fine-granularity backup points onsite and also offsite,
if replicated.
It is estimated that 99% of IT environments use a traditional, longer-term backup system.
About 50% of IT environments deploy some type of Primary SNAPs as well.
Sorting Through the Confusion P a g e 12 |
14. Backup Application Deduplication in the Media Server
Some backup applications have a data deduplication feature that can be deployed as an agent in
the media server. The intent is to be able to eliminate tape using standard disk in conjunction
with the backup application.
Data deduplication is a very compute
intensive process. If deduplication is
run in the media server, resource
utilization will increase significantly.
This can slow backups down
dramatically.
To avoid this hit to overall backup
performance, backup software uses a
form of deduplication that results in a
Figure 6 - Running Deduplication on Media Server
lower reduction rate. Using the least
possible processor and memory
resources for the deduplication process avoids starving the media server tasks of resources, but
at the cost of lowering deduplication performance.
Typically this approach uses 64KB or 128KB fixed blocks and will yield a data reduction ratio of
about 6-7:1. By comparison, target-side appliances that use byte, zone-byte or segment-block
with variable length content-splitting average from about a 20:1 to as much as a 50:1 data
reduction ratio, or a minimum of approximately three times that of the software approach.
In addition, software deduplication can only process data that comes from its own proprietary
agents. It cannot deduplicate data from other sources including other backup applications,
utilities or data base dumps.
Some vendors bundle the media server software on a storage server that includes a CPU,
memory and disk. This does not change the deduplication rate or the heterogeneous nature of
the solution.
Strengths
Relatively simple to manage through the backup application
Good for environments that have less than 3TB of data to backup, use a single backup
application and do not plan to replicate to a second site for disaster recovery
Weaknesses
Disk usage is high as the deduplication ratio is only 6-7:1. Over time the disk space
required grows sharply.
Sorting Through the Confusion P a g e 13 |
15. Bandwidth needed to send backups to a second site is high as the deduplication ratio is
only 6-7:1. By comparison, target-side appliances that use byte, zone-byte or segment-
block with variable length content-splitting average from about a 20:1 to as much as a
50:1 data reduction ratio, or a minimum of approximately three times that of the software
approach.
Cannot deduplicate data from:
Veeam, VizionCore
Lightspeed, SQL Safe, Redgate
Direct SQL Dumps, Direct Oracle RMAN Dumps
Bridgehead for Meditech data
Direct UNIX TAR files
Other traditional backup applications
Summary
Deduplication in the backup software is good for short-term retention and low amounts of
data in environments that are not heterogeneous and where offsite disaster recovery data
is not required.
Sorting Through the Confusion P a g e 14 |
16. Backup Application Client Side Deduplication
Some industry backup applications offer a form of
data deduplication in the application server agents
or clients. The intent is to be able to eliminate tape
using standard disk along with the backup
application. The deduplication occurs at the backup
agent/client on each application server.
Data deduplication is a very compute intensive
process. Resource utilization will increase
significantly if deduplication is run in the application
server (client side), and slow down backups
dramatically.
To minimize this impact, client side deduplication
software approaches use a less-efficient form of
Figure 7 - Client Side Deduplication
deduplication. Typically they use 64KB or 128KB
fixed blocks where they achieve a data reduction
rate of about 6-7:1. By comparison, target-side appliances that use byte, zone-byte or segment-
block with variable length content-splitting average from about a 20:1 to as much as a 50:1
data reduction ratio, or a minimum of approximately three times that of the software approach.
Running a compute intensive deduplication process on your applications servers creates other
performance and availability challenges. Furthermore, databases and email, which are 80% of
the Monday through Thursday backups, are still sent as full backups. This means that only 20%
of the nightly data is actually deduplicated, by client side deduplication, during the week. The
true impact is on the Friday night full backup, where 80% of the data is unstructured file data.
In addition, the software approach to deduplication can only process data that comes from its
own proprietary agents. It cannot deduplicate data from other sources including other backup
applications, utilities or data base dumps.
Strengths
Great fit for deduplicating data from small remote sites, then replicating it back to a
corporate datacenter for backup.
This approach can shorten the backup window, but only on the Friday full backup. During
the week, backups are still full backups for data base and email.
Weaknesses
Requires new agents on servers; added risk and cost of changing agents.
Sorting Through the Confusion P a g e 15 |
17. Deduplication ratio is only 6-7:1 and the disk space required increases quickly.
Bandwidth usage to a second site is high as the deduplication ratio is only 6-7:1. By comparison,
target-side appliances that use byte, zone-byte or segment-block with variable length content
splitting average from about 20: 1 to 50: 1 data reduction ratio, or at a minimum three times
that of software deduplication.
Cannot deduplicate data from:
Veeam, VizionCore
Lightspeed, SQL Safe, Redgate
Direct SQL Dumps, Direct Oracle RMAN Dumps
Bridgehead for Meditech data
Direct UNIX TAR files
Other traditional backup applications
Summary
Very good for replicated remote site data back to a corporate datacenter
Very few businesses actually use this approach due to its risk to application servers and
weaknesses
Sorting Through the Confusion P a g e 16 |
18. Purpose-Built Target Side Deduplication Appliances
Target-side deduplication appliances are built specifically to replace the tape library in the
backup process onsite and, optionally, offsite. Because they are dedicated appliances, the
hardware and the deduplication methods used can be optimized for that single purpose. Future
disk space requirements to deal with data growth are drastically reduced because deduplication
ratios from 20:1 to as much as 50:1 can be achieved, Only the data that changes, about 2% of
the backup size, is replicated offsite and requires far less bandwidth.
In addition, target-side appliances can process
data from a variety of utilities and backup
applications.
Strengths
No change to your backup environment.
Use all backup applications, utilities and
dumps you are currently using.
Can take in data from:
Figure 8 - Target Side Deduplication Appliance
Traditional backup
applications
Veeam, VizionCore
Lightspeed, Redgate, SQLSafe
SQL Dumps, Oracle RMAN dumps
Direct UNIX TAR files
Many other backup applications and utilities
20:1 to as much as 50:1 deduplication ratios use less disk space and far less bandwidth
for replication.
Special features for:
Tracking data to offsite Disaster Recovery
Improving Disaster Recovery RPO (recovery point objective) and RTO (recover time
objective)
Purging data as the retention policy calls for aging out data
Weaknesses
Backup window improves over using a tape library, but not by as much as client side
deduplication for the Friday night full backup
Sorting Through the Confusion P a g e 17 |
19. Summary
When evaluating different approaches to replacing tape with disk, take the time to ask the right questions and
understand the strengths and weaknesses of each alternative.
Sorting Through the Confusion P a g e 18 |
20. About ExaGrid
ExaGrid is the leader in cost-effective disk-based backup solutions. A highly scalable system that
works with existing backup applications, the ExaGrid system is ideal for companies looking to
quickly eliminate the hassles of tape backup while reducing their existing backup windows.
ExaGrid’s innovative approach minimizes the amount of data to be stored by providing standard
data compression for the most recent backups along with zone-level data deduplication
technology for all previous backups. Customers can deploy ExaGrid at primary and secondary
sites to supplement or eliminate offsite tapes with live data repositories or for disaster recovery.
With offices and distribution worldwide, ExaGrid has more than 3,500 systems installed and
hundreds of published customer success stories and testimonial videos available at
www.exagrid.com.
Sorting Through the Confusion P a g e 19 |