2. 1
Disclaimer
NÝHERJI ACCEPTS NO LIABILITY FOR THE CONTENT
OF THIS PRESENTATION, OR THE CONSEQUENCES OF
ANY ACTIONS TAKEN ON THE BASIS OF THE
INFORMATION PROVIDED, UNLESS THAT INFORMATION
IS SUBSEQUENTLY CONFIRMED IN WRITING.
ANY VIEWS OR OPINIONS PRESENTED IN THIS SESSION
ARE SOLELY THOSE OF THE AUTHOR AND DO NOT
NECCESARILY REPRESENT THOSE OF IBM OR NÝHERJI.
3. 2
About Nyherji and Peter
Nyherji is one of Iceland's leading
service providers in the field of
information technology, offers
complete solutions in the fields of
information technology, including
consultancy, the provision of
hardware and software, office
equipment and technical service.
Pétur Eyþórsson is a Lead designer
of TSM and DR planning
infrastructure for all of Nyherji´s
TSM Customers for the last 14
years, and an amateur folk style
wrestler.
4. 3
Our Environment
Nyherji manages roughly 50 TSM Servers
TSM servers come in many shapes protecting 5 – 5,000 TB.
Main OS Windows, AIX and Linux.
TSM Server versions mostly 6.3
Mostly midrange customers, that historically have used the
traditional disk-to-tape approach.
No VTL´s, XIV or any other high end devices exist
Wide distribution of Storwize V7000 and V3700
5. 4
Businesses are storing unnecessary data
Businesses are spending 20% more than they need to on
backing up unnecessary data.
– “The most common mistake businesses make is to fail to
update their backup policies. It is not unusual for companies to
be using backup policies that are years or even decades old,
which do not discriminate between business-critical files
and the personal music files of employees.” ~Gartner
Especially notorious problem in Icelands financial institutions
for historical reasons.
6. 5
Our History
TSM has made some improvements that offer some new
approaches
– TSM 6.1 (New TSM Database DB2, Introduced target
deduplication)
– TSM 6.2 (Introduced Source [client side] Deduplication)
– TSM 6.3 (Introduced Node Replication, FCM 3.1, and TDP for
VMware)
– IBM acquired FastBack
– TSM 6.4 (Enhancements to Node Replication, TSM Server
scalability, FCM 3.2 intorduced support for Netapp devices as well
as Metro Mirror/Global Mirror)
Our past experience was solely based on conventional TSM disk-to-
tape servers
These new technologies offered new options that show great
potential.
7. 6
Our History
In November 2010, we decided to move 2 of our TSM servers
to a highly deduplicated environment.
We had no prior experience with TSM deduplication and not
much experience existed on the market that we could tap into.
Since then we have moved 2 other big environments to
deduplication and FCM
Ajustments have been made based on prior experience, as
we go along.
Purpose of this presentation is to show you how we use our
TSM Environments
8. 7
Our Enviorment
IBM Tivoli Storage Manager Suite for Unified
Recovery license
– Changes everything
– No more PVU Counting
– Incentive to push for technologies like FCM, deduplication and
compression.
• New challenges
6 High perfomance environments (3 sites) emerged
– Use Deduplication where possible
– Flash Copy Manager
– TSM Node Replication
– Block Level backup where possible
– High utilization of Client Compression
9. 8
Design Goals
RTO Goal on Important Data less than 1 Hour
TSM Server RTO less than 6 Hours
Has to be Cost effective!
– NL-SATA for Storage Pool
– TSM Deduplication
13. 12
FCM VMware backup
Daily Inc, Weekly Full
– 2 week cycle
2 Device Classes (FCM)
– STANDARD
– INCREMENTAL
TDP for VMware used for weekly backup, 90 day retention
– Daily on Linux FCM Management Machine
Benefits of FCM for VMware
– MUCH Faster Restore Speed (Data Stores)
– No CBT issues
– Cheap, License (Storwize)
14. 13
FCM Naming Conventions
One FCM DeviceClass
for each backup type
– Full, Incr, Copy
TARGET_NAMING
must specify a valid
target naming schema
– Difficult/Impossible
to manage if not
structured properly
Schedules backup
whole ESX Clusters
15. 14
Our TSM Dedup High Availability Solution
Primary Site • Secondary Site
Node Replication
Node Replication
Active data (no Oracle)
Oracle
Primary
Metro Mirror
16. 15
Why FCM for Oracle
Restore/Backup time reduced down to minutes.
Added workload of client deduplication & compression not
accpetable for the DBA´s on Production Machines.
– Auxiliary/Proxy machine to backup to TSM from FCM copies
does the Deduplication & Compression and sends to TSM
TDP for Oracle does not distinguish between Active/Inactive
copies.
– DR Problem when using Active Data Node Replication
• Solved with FCM MM Copy
17. 16
Our Enviorment
Total Storage of 95TB weekly change
of 25TB (before backup/dedup)
Intel Based Servers
– 120-140 GB Ram
– TSM DB
• 8 SSD DB
– Raid-5
– EasyTier 22 SAS
– 1,7 TB SSD
» Total available
Size 6TB
• 48 SAS 15k rpm
– Raid-10
» Total available Size 6
TB
– 8-12 Cores
V7000 Contoller
– DS3700 & DS3200
– 3TB NL-SAS drives
– 170TB Usable Storage
• RAID-6
FCM for higher RTO needs
Node Replication for high availability
Tape for
– long term storage
– Data that does not fit in dedup
storage
18. 17
TSM 24 hour work schedule
FCM Backups 18:00
Main Client Backups 18:00-02:00
Expire Inventory 02:00-03:30
Identify Duplicates 03:30-04:15
TSM DB backup 04:30-06:00
TSM File Data Node Replication 06:00-10:00
TSM Virtual Node replication 10:00 – 16:00
TSM Database Node Replication 14:00-18:00
TSM * Replication to capture missed and new data
TSM Space Reclamation 13:00 – 18:00 (Threshold 10)
20. 19
What we learned
Achieving increased perfomance
– Solved with engineering parallelism
• Solved differently between different applications
21. 20
What we learned
Protecting dedup storage pool to tape copypools proved
problematic
Perfomance based on filesize
* Fabricated data
Node Replication Solved this
22. 21
Why Node Replication as High Availability
4 possible solutions
– OS Cluster (Windows Cluster, AIX HACMP)
• Pros
– Robust
– Automated Failover
• Cons
– Only OS fault torlerant
– Traditional Server-to-Server Copy Pool Virtual Volumes
• Pros
– Robust
– Volume failure recovery
• Cons
– Long RTO
– Cumbersome and long recovery (especially Dedup TSM Servers)
– TSM Node Replication
• Pros
– Relative simple failover
– Warm standby server ready to go
• Cons
– Young technology
– No easy way to recover from damaged volumes
– TSM DB2 HADR
• Pros
– Easy Failover
– Cold Standy berver ready to go
– Can take over metadata only
• Cons
– No installation of the kind we proposed existed in production.
23. 22
Our Perfomance Design Formula
IF (X) AND (Y) = < 95%
THEN (N) + 1
X = CPU Load
Y = Disk Response time (15 ms =95%)
N = Number of parallel worker threads in TSM
• Simplified formula to maximise workload in our TSM Servers
– If idle TSM Resource is detected more threads are added.
• CPU or Disk response time should always be the bottleneck
24. 23
Two different sites used to collect information
TSM enviorment 1
2 TSM Servers Active/Inactive
(NR)
Windows 2008 R2
8 Core
140 Gig RAM
1,4 TB DB
– 44 (SAS RAID-10)
70TB Dedup Storage Pool
– DS3200
• 2&3TB SATA RAID-5 and
RAID-6
Current Bottlenck
– CPU´s
TSM enviorment 2
2 TSM Servers Active/Inactive
(NR)
Windows 2008 R2
12 Core (24 multithreaded)
120 Gig RAM
1,7 TB DB
– 4 Internal SSD (RAID-5)
– Easy Tier V7000
• 4 400Gb SSD (RAID-5)
• 20 SAS (RAID-10)
~90TB Dedup Storage Pool
– DS3700 behind V7000
• 3TB SATA RAID-6
Current Bottleneck
– SSD´s
25. 24
Perfomance Data
• TSM Server 1,1 TB Database, 80 TB Dedup Pool
• 20 Thread deduplication space reclimation (threshold 10)
– Sustained 5,000-9,000 IOPS
• 17 TB of total data transfer
– 7 Write
– 10 Read
• CPU Near Fully Utilized
TSM env1
26. 25
Perfomance numbers
1,3TB TSM Database on Storwize SSD/SAS Easy Tier
– Max 30,000 IOPS!
– Space Reclaim
Moves 4,3TB pr/H
(read & write)
Average DB IOPS
8,000-12,000
TSM env 2
27. 26
Perfomance
TSM Dedup enviorment total in 24 hours
– Database writes ~x1 it´s size every day
• 1TB Database writes 970GB
– Database Reads ~x 1,5 it´s size every day
• 1TB Database reads 1,5TB
– SSD´s in Raid-5 becomes bottleneck during write intentisve
operations (Space Reclimation W/R 50/50)
TSM env 2
28. 27
What has changed
Use Deduplication where we can
– Exclusion/Special treatment:
• Very large single objects
• Encrypted data
• Large non repetetive data
We can’t use storage pool hierarchy based on small file pools
anymore, Client Dedup restrictions
We Assign a spesific DISK device class VMware Control
Storage Pool
– Reduces the mount points requirement
30. 29
What we have learnd so far
Our TSM Dedup servers can scale up 400TB of managed data (pre-
dedup) this is based on DB size
We can´t be cheap when it comes to our TSM Servers Hardware
– A lot of RAM 48Gig min
– 12 cores (Intel)
• Only put multiple TSM instances on AIX.
– Use really fast disks for your database, it´s going to get hammered
(5000+ IOPS). Preferably SSD or a lot of spindles
– We use maximum active log size off the bat
• We must be careful about our space reclaim workload, many threads
can eat up all the log´s really fast.
• Gigantic single objects (1,0TB +) will pin the log, must be careful
about workload during that object´s backup time.
– Larger databases.
• X2 if you dedup only B/A client data.
• X3 - ∞ If you dedup Application data as well.
– Depends a lot on how long you plan to keep your data in the pool
– Copypool to tape from the dedup pool proved difficult to use , used
node replication instead.
31. 30
What we have learnd so far
We Use Client Deduplication if we have the backup window to do
so.
– It will cause performance degradation on your backups, application
backups are more affected. (assuming no transportation
bottleneck)
– Send all client data directly to the dedup storage pool
– Saves a lot of work on your SATA drives
We use VMware backup when possible, ALL VM´s must be on high
enough HW level to support CBT.
– If not Use FCM only on those machines
We use Client Compression to add more space savings.
– Must be careful about use of 3rd party compression, it may have
adverse affects on deduplication ratio.
– We DON’T use Compression if we plan on doing server-side
deduplication, bad ratio
32. 31
What we have learnd so far
Utilize client parallelism for greater speed and workload
– B/A client - resource utilization
– TDP SQL, - multiple database backup concurrently or
stripes
– VMware - use Vmmaxparallel, be carefull not to exceed
8 on each host at the same time
– Exchange - multiple data movers
We Keep our Deduplication volumes small 12-24G
Run aggressive space reclamation
– Aim for 10% Threshold (90% or less utilized)
Keep large objects in a separate storage pool (active log size)
Keep VMware CTL files in a separate (DISK Device class) storage
pool.
TSM Database backup as fast as possible
– All log activity during backup window will be applied during the end
– Need increased space in active log due to this.
33. 32
What we have learnd so far
Plan for a small tape pool for data that does not suit well in
dedup storage pools
– Application databases that change a lot (reorgs, index rebuild
e.t.c)
– Data that requires encryption.
– Very large single objects 750G +
For higher RTO systems we utilize FCM as a method to
achive instant restore for newest data, alternatively we utilize
parallel threads to achieve your RTO goal, but there are
drawbacks.
Reajust the Copy Rate for FCM, default setting does not
always apply
In rare cases heavy utilized applications can´t handle LUN
quiescing
– FCM or Blok level backups can´t be used.
35. 34
Chicago´s World Fair 1893
• Moral of the story
• Expect more technological
innovation than you can
imagine in comming years
• Don’t get your hopes up,
IT´s innovation won’t solve
all it´s problems.
- With new and improved
technology new challenges
emerge.