Data Footprint Reduction: Understanding IBM Storage Options

sSE20
Data Footprint Reduction:
Understanding IBM Storage
Efficiency Options
Tony Pearson
Master Inventor and Senior Managing Consultant, IBM Corp

Sanjay S Bhikot
Advisory Unix and Storage Administrator, Ricoh Americas Corp

#IBMEDGE © 2012 IBM Corporation

Data Footprint Reduction is the
catch-all term for a variety of
technologies designed to help
reduce storage costs. This session
will cover thin provisioning, space-
efficient copies, deduplication and
compression technologies, and
describe the IBM storage products
that provide these
capabilities.


Sessions -- Tony Pearson
• Monday
– 1:00pm Storing Archive Data for Compliance Challenges
– 4:15pm IBM Watson: What it Means for Society
• Tuesday
– 4:15pm Using Social Media: Birds of a Feather (BOF)
• Wednesday
– 9:00am Data Footprint Reduction: IBM Storage options
– 2:30pm IBM's Storage Strategy in the Smarter Computing era
– 4:15pm IBM SONAS and the Cloud Storage Taxonomy
• Thursday
– 9:00am IBM Watson: What it Means for Society
– 10:30am Tivoli Storage Productivity Center Overview
– 5:30pm IBM Edge “Free for All” hosted by Scott Drummond

3

Agenda

• Thin Provisioning
• Space-Efficient Copy
• Data Deduplication
• Compression


History of Thin Provisioning

The StorageTek
Iceberg 9200 Array
Introduced Thin 1997 Today
Provisioning on
slower 7200RPM
drives for mainframe
systems Thin Provisioning is
available for many
operating systems
1994 on IBM storage,
including DS8000,
IBM resold this as XIV, SVC, N series,
the RAMAC Virtual Storwize V7000,
Array (RVA) for DS3500 and
mainframe servers DCS3700

5

Why Space is Over-Allocated
• Scenario 1 • Scenario 2
– Space requirements – Space requirements
under-estimated over-estimated
– Running out of space – Capacity lasts for years
requires larger volume • No data migration
– New request may take • No application outages
weeks to accommodate • No penalties
• Application outage if
not addressed in time
– Data must be moved to When faced with this dilemma,
the larger volume most will err on the side of
over-estimating
• Application outage
during data movement

6

Fully Allocated vs. Thin Provisioned
Allocated but unused space
dedicated to this host,
wasted until written to
Host sees fully
allocated amount Actual data written

Empty space available to others

Physical Space Allocated
Host sees full
virtual amount Actual data written

7

Fully Allocated vs. Thin Provisioned

Volume/LUN – one or more extents

Host sees a volume
or LUN that consists Extent – Allocation Unit
of blocks numbered One or more grains
0 to nnnnnnnnnn

Grain – range of 1 or more blocks

Block – typically 512 or 4096 bytes

8

Coarse and Fine-Grain

9 Block 00, 55, and 99 written
8 Fully Allocated, all 10 extents allocated
Coarse-Grain, only 3 extents allocated
7 Fine-Grain, only 1 extent allocated
6
5
Grain 00-01
4 Grain 90-99 = extent
3 Grain 54-55

2 9 Grain 98-99
1 5
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Fully Allocated Coarse-Grain Fine-Grain
9

How IBM has implemented TP

IBM DS8000 IBM XIV SVC and DS3500,
Storwize DCS3700
V7000
Type Coarse Fine Fine Fine

Allocation 1 GB 17 GB 16MB to 4 GB
Unit 8GB

Grain size 1 MB 32-256 KB 64 KB

10

Thick-to-Thin Migration

Volume
Fully-allocated mirror Thin-
volume
provisioned
volume
Copy 0 Copy 1

Only non-zero blocks copied

11

Empty Space Reclaim

Thin Provisioning, allocations in 17GB units, with
1MB chunks (grains). Only non-zero blocks consume
physical space.

Avoid writing empty blocks, any I/O request that
tries to write a block of all zeros to unallocated space
is ignored.

Background task to find empty chunks, a
background task scans all blocks, looking for chunks
containing all zeros.

Empty space reclaimed empty chunks are
returned to unallocated space, so that it can be used
for other volumes

12

*** IBM Confidential until July 12, 2011 ***

Thin Provisioning
Pros • Cons
Just-in-Time increased Not all file systems
utilization percentage cooperate or friendly
Eliminates the pressure to Deletion of files does not
make accurate space free space for others
estimates “sdelete” writes zeros over
deleted file space
Dynamically expand
volume without impacting Some implementations may
applications or rebooting impact I/O performance
server May not support same set
Reduces the data footprint of features, copy services,
and lowers costs or replication
Shifts focus from volumes “Writing checks you can’t
to storage pool capacity cash”

13

History of Space-Efficient Copies

1997 Today

NetApp introduces
Space-Efficient Copy
Snapshot in its
is available on many
WAFL file system
IBM storage systems,
1993 including DS8000, XIV,
SVC, N series,
IBM Enterprise Storwize V7000,
Storage Server DS3500, DS5000 and
(ESS) introduces DCS3700
NOCOPY parameter
on FlashCopy

15

Space-Efficient Copies
300 GB

Source
Traditional Copies

Destination 1 Destination 2 Destination 3

100 GB allocated
40 GB written Space-Efficient Copies. 10% reserved

30 GB
16

Method 1: Copy on Write (COW)
Source Destination • Copy-On-Write (COW)
– Copy is set of pointers to
Block A B C D original data
– Write to original volume:
• Pause I/O
Source Destination • Copy original block of data to
destination
• Update original block
Block A B C2 D C
– Slows performance
– May limit # of destination
copies
– Can be combined with
background copy for a full
copy
17

Method 2: Redirect on Write (ROW)

Source Destination • Redirect-On-Write (ROW)
– Copy is set of pointers to
Block A B C D original data
– Write to original volume:
• Re-directed to new empty
Source Destination space
• Previous data left alone
Block A B C D C2
– Does not impact
performance
– Supports many destination
copies

18

Space-Efficient Copies
Pros • Cons
Supports both Some implementations
Fully-allocated and may impact I/O
Thin-Provisioned Sources performance
Reduces the data footprint Requires that you
and lowers costs estimate the maximum
Allows you to keep more percentage changed
copies online • Typically 10-20 %
Allows you to take copies Exceeding the reserved
more frequently space invalidates
Can be used as destination copy
checkpoint copies during
batch processing

19

History of Data Deduplication

Advanced Single Today
2008
Instance Store
(A-SIS) bring
deduplication for the
IBM N series and IBM offers a variety of
NetApp disk storage choices, including
ProtecTIER, N series,
and Tivoli Storage
2007 Manager (TSM v6)
IBM acquires Diligent
and introduces the
ProtecTIER TS7600
virtual tape library with
data deduplication

21

Data Deduplication

• Data deduplication reduces capacity requirements by
only storing one unique instance of the data on disk
and creating pointers for duplicate data elements

22

Deduplication reduces disk
required for backup copies

23
23

Two Primary Data Deduplication
Approaches

Hash based HyperFactor
Deduplication
A different approach
Sometimes referred to based on an agnostic
as a Content view of data
Addressable Storage
approach

24
24 31-May-12

Hash-Based Approach

1. Slice data into chunks (fixed or variable)

A B C D E

2. Generate Hash per chunk and save
Ah Bh Ch Dh Eh

3. Slice next data into chunks and look for Hash Match

A B C D E

4. Reference data previously stored
25
25 31-May-12

HyperFactor Approach

1. Look through data for similarity

New Data Stream

2. Read elements that are most similar
3. Diff reference with version – will use several elements

Element A Element B Element C

4. Matches factored out – unique data added to repository

26
26 31-May-12

Assessment of Hash-based
Approaches
Example: Imagine a chunk size • Applicable for all chunking
of 8 KB methods
• 1 TB repository has
• Hash Table in Memory
~125,000,000 8 KB chunks
– Overhead for in-band deduplication
• Each hash is 20 bytes long
– Hash table will grow with data volume
• Need pointers scheme to
– Growing hash-table may become
reference 1 TB performance bottleneck
The hash-table requires 2.5 GB – Scalability issues
RAM
» no issue • Hash-Collisions must be handled
• Hash table must be protected
With a 100 TB repository – One copy might not be sufficient
» ~250 GB of RAM is
required

27

When Deduplication Occurs
1. In-line Processing
– As data is received by the target device it is
• Deduplicated in real time
• Only unique data stored on disk
– Data written to the disk storage is deduplicated

2. Post-Processing
– As data is received by the target device it is
• Temporarily stored on disk storage
– Data is subsequently read back in to be processed by a
deduplication engine

28

Comparison of Offerings

Hash-based HyperFactor

In-line Other vendors IBM ProtecTIER
Process –TS7680G
–TS7650G
–TS7650
–TS7620 Express
–TS7610 Express
Post- • IBM Tivoli Storage
Process Manager (TSM)
• N series

29

IBM ProtecTIER with HyperFactor

• Gateways
– Attaches up to 1PB of disk
– Two models:
• TS7680 for IBM System z
• TS7650G for distributed systems

• Appliances
– Disk included inside
– Three models for distributed
systems
• TS7650 … in three sizes
• TS7620 (New!)
• TS7610 ... in two sizes
30

ProtecTIER vs.
Tivoli Storage Manager
Both Solutions Offer the Benefits of Target side Deduplication:
– Greatly reduced storage capacity requirements
– Lower operational costs, energy usage and TCO Complementary
Solutions Today!
– Faster recoveries with more data on disk Can be used together but don’t
deduplicate the same data twice
Use ProtecTIER When:
– Highest performance and capacity scaling are required!
– Up to 1400 MB/sec (2.5GB/s with 2 node) deduplication rates are needed
– Deduplicated capacities up to 25 PB are required IBM TS7600
– You wish to avoid operational impact of post processing deduplication
– A VTL appliance model is desired
– Deduplicating across multiple TSM (or other backup) servers

Use TSM 6 Built-in Deduplication When:
– You desire deduplication operations be completely integrated within TSM
– The benefits of deduplication are desired without separate hardware or
software dependencies or licenses (ships with TSM Extended Edition)
– You desire end to end data lifecycle management with minimized data TSM
store
31

Data Deduplication
Pros • Cons
Designed for backups Dealing with Hash
Can offer up to 25x data Collisions
footprint reduction • May require byte-for-byte
• Allows disk backup comparisons or keeping
repositories to approach secondary copy of data
cost of tape-based Some systems do not scale
solutions Some systems have slow
Allows more backup restores
copies to remain on disk • Re-hydrating data back to
for faster restores normal
Available with a variety of Primary data may not
interfaces, including VTL, dedupe very well
OST and NAS • Your mileage may vary!

32

History of Compression
Today

1986
NASA and IBM developed IBM offers
the Houston Aerospace real-time compression
Spooling Protocol (HASP) for file and block level
with compression for long access to disk storage
distance data transmission.

1973
IBM introduced the
Improved Data
Recording Capability
(IDRC) for the 3480
tape drive

34

Lossy vs. Lossless Methods

Compress
Compress

Decompress
Decompress
returns data
does not return
back to its Exactly
data back to its Good
original contents the same
original contents enough?

• Lossy • Lossless
– Used with music, photos, video, – Used with databases,
medical images, scanned emails, spreadsheets, office
documents, documents, source code
fax machines
35

How Compression Works

• Lempel-Ziv lossless compression builds a dictionary of repeated
phrases, sequences of two or more characters that can be
represented with fewer number of bits
• In the above excerpt from “Lord of the Rings”, all of the red text
represents repeated sequences eligible for compression!

Source: The Lempel Ziv Algorithm, Christian Zeeh, 2003
36

Compressed Volumes
Allocated but unused space
dedicated to this host,
wasted until written to Physical Space
Allocated

Actual data written Actual data written

Host sees full
virtual amount
Physical Space
Allocated, up to 80%
Actual reduction from actual
data data written
written

37

Real-time Compression!
Workstations • Real-time Compression for primary data
IP – Less data stored on primary storage (up to 80%)
Network
– No changes to applications or procedures

Application
Servers
• Before it gets to the storage array
– Larger effective storage cache
– Disk Array can serve more requests from its read /
write cache
– Lower storage CPU overhead

Cache Cache
• Does not cause performance degradation
– Much smaller I/O / lower disk workload
– Reads/Writes are faster due to storage array’s
response from cache instead of disk
– Additionally reads may come from advanced read
ahead cache (no write cache)
Disk Array

38
38

FIVO vs. VIFO
Compressed Compressed
Data Data
Data Data
1 1 1 1

2 2 2 2

3 3 3 3

4 4 4 4

5 5 5 5

6 6 6 6

• Fixed Input, Variable Output • Variable Input, Fixed Output
– WAN transmission – Random Access Compression
– Sequential tape Engine™ (RACE)
– IBM Tivoli Storage – IBM Real-Time Compression
Manager Appliances
– zip, tar, etc. – IBM SVC, Storwize V7000

39

Compression for Disk data
Traditional Approaches Real-time Compression

Compression after Modification File
Compression after Modification
A B C
A B C A B C
D E F
File D MN F File D MN F
G H I
G H I G H I
New
Compressed ABC DEF GHI New
ABC DMN FGH I
File Compressed ABC DEF1 GHI MN
Blocks Shift File
Compressed
File Identical Blocks
• Extra work to ‘edit’ a file
• Small amount of work / I/O to edit
• All blocks shift
– Only one common block • Only modified block changes
(this example) – Multiple common blocks
– Negative impact to deduplication – Enhances deduplication

• No notion of data location • Data location via map

40
40

Compression Without Compromise
Expected Compression Ratios

Up to 80%
Databases

Linux virtual OSes Up to 70%
Server
Virtualization Windows virtual OSes Up to 55%

Office 2003 Up to 75%
Collaboration Office 2007 or later Up to 25%

Up to 75%
CAD/CAM Engineering/Design

41
41

Objectives:
• Run over a block device
• Estimate:
– Portion of non-zero blocks in the volume.
– Compression rate of non-zero blocks with RTC.
Performance:
• Runs FAST! < 60 seconds, no matter what the volume size
– Typical running time on a machine with multiple disks: < 20 seconds
• Give guarantees on the estimation: ~5% max error guarantee
– Can improve guarantee with more running time

Method:
• Random sampling and compression throughout the volume
• Collect enough non-zero samples to gain desired confidence
– More zero blocks slower (takes more time to find non-zero blocks)
• Mathematical analysis gives confidence guarantees

• Note: we are estimating compression during migration of a volume into RTC (data at rest)

42

IBM Real-Time Compression
• For NAS devices • For Block devices
– IBM Real-Time – SAN Volume Controller
Compliance Appliance – Storwize V7000

STN 6500 SAN Volume Controller

STN 6800 Storwize V7000


Migrating to Compressed Disk

Volume
Fully-allocated mirror Compressed
or Thin-provisioned
volume
volume

Copy 0 Copy 1

Only non-zero blocks copied

44

Data Compression
Pros • Cons
Can be used for data Some implementations are
transmission, tape and post-process
disk data • Stores uncompressed
Can offer up to 80% data data first, compress later
footprint reduction Some implementations
Available as front-end impact performance and/or
appliance or integrated consume substantial CPU
into storage system resources
Can be Benefits vary by data type,
“Dedupe-Friendly” and whether applications
do their own compression
or encryption
• Your mileage may vary

45

Thank You!

Session: sSE20
Presenters: Tony Pearson,
Sanjay Bhikot

#IBMEDGE
Intel, the Intel logo, Xeon and Xeon Inside are trademarks or registered
trademarks of Intel Corporation in the U.S. and /or other countries.

Additional Resources

Email:
tpearson@us.ibm.com

Twitter:
http://twitter.com/az99Øtony

Blog:
http://ibm.co/brAeZØ

Books:
http://www.lulu.com/spotlight/99Ø_tony

IBM Expert Network:
http://www.slideshare.net/az99Øtony

62
62

Trademarks and disclaimers
© IBM Corporation 2012. All rights reserved.

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government
Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or
registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States,
other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL
is a registered trademark, and a registered community trademark of The Minister for the Cabinet Office, and is registered in the U.S. Patent and Trademark Office. UNIX is a
registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other contries, or both and is used under license
therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries.

Other product and service names might be trademarks of IBM or other companies. Trademarks of International Business Machines Corporation in the United States, other
countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml.

Information is provided "AS IS" without warranty of any kind.

The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
environmental costs and performance characteristics may vary by customer.

Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not
constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor
announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related
to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance,
function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to
communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and
the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated
here.

Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your
IBM representative or Business Partner for the most current pricing in your geography.

Photographs shown may be engineering prototypes. Changes may be incorporated in production models.

References in this document to IBM products or services do not imply that IBM intends to make them available in every country.

63

Data Footprint Reduction: Understanding IBM Storage Options

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Footprint Reduction: Understanding IBM Storage Options

Similaire à Data Footprint Reduction: Understanding IBM Storage Options (20)

Plus de Tony Pearson

Plus de Tony Pearson (20)

Dernier

Dernier (20)

Data Footprint Reduction: Understanding IBM Storage Options