Powerpoint exploring the locations used in television show Time Clash
Techniques for Managing Huge Data LISA10
1. USENIX LISA10 November 7, 2010
Techniques for Handling Huge
Storage
Richard.Elling@RichardElling.com
USENIX LISA’10 Conference
November 8, 2010
Sunday, November 7, 2010
2. USENIX LISA10 November 7, 2010
Agenda
How did we get here?
When good data goes bad
Capacity, planning, and design
What comes next?
2
Note: this tutorial uses live demos, slides not so much
Sunday, November 7, 2010
4. USENIX LISA10 November 7, 2010
Milestones in Tape Evolution
4
1951 - magnetic tape for data storage
1964 - 9 track
1972 - Quarter Inch Cartridge (QIC)
1977 - Commodore Datasette
1984 - IBM 3480
1989 - DDS/DAT
1995 - IBM 3590
2000 - T9940
2000 - LTO
2006 - T10000
2008 - TS1130
Sunday, November 7, 2010
5. USENIX LISA10 November 7, 2010
Milestones in Disk Evolution
5
1954 - hard disk invented
1950s - Solid state disk invented
1981 - Shugart Associates System Interface (SASI)
1984 - Personal Computer Advanced Technology (PC/AT)Attachment,
later shortened to ATA
1986 - “Small” Computer System Interface (SCSI)
1986 - Integrated Drive Electronics (IDE)
1994 - EIDE
1994 - Fibre Channel (FC)
1995 - Flash-based SSDs
2001 - Serial ATA (SATA)
2005 - Serial Attached SCSI (SAS)
Sunday, November 7, 2010
6. USENIX LISA10 November 7, 2010
Architectural Changes
Simple, parallel interfaces
Serial interfaces
Aggregated serial interfaces
6
Sunday, November 7, 2010
8. USENIX LISA10 November 7, 2010
Failure Rates
Mean Time Between Failures (MTBF)
Statistical interarrival error rate
Often cited in literature and data sheets
MTBF = total operating hours / total number of failures
Annualized Failure Rate (AFR)
AFR = operating hours per year / MTBF
Expressed as a percent
Example
MTBF = 1,200,000 hours
Year = 24 x 365 = 8,760 hours
AFR = 8,760 / 1,200,000 = 0.0073 = 0.73%
AFR is easier to grok than MTBF
8
Operating hours per year is a flexible definition
Sunday, November 7, 2010
9. USENIX LISA10 November 7, 2010
Multiple Systems and Statistics
Consider 100 systems each with an MTBF = 1,000 hours
At time=1,000 hours, 100 failures occurred
Not all systems will see one failure
9
0
10
20
30
40
0 1 2 3 4
NumberofSystems
Number of Failures
Very, Very Unlucky
Unlucky
Very Unlucky
Sunday, November 7, 2010
10. USENIX LISA10 November 7, 2010
Failure Rates
MTBF is a summary metric
Manufacturers estimate MTBF by stressing many units for
short periods of qualification time
Summary metrics hide useful information
Example: mortality study
Study mortality of children aged 5-14 during 1996-1998
Measured 20.8 per 100,000
MTBF = 4,807 years
Current world average life expectancy is 67.2 years
For large populations, such as huge disk farms, the summary
MTBF can appear constant
Better question to be answered, “is my failure rate increasing or
decreasing?”
10
Sunday, November 7, 2010
11. USENIX LISA10 November 7, 2010
Why Do We Care?
Summary statistics, like MTBF or AFR, can me misleading or risky if
we do not also distinguish between stable and trending processes
We need to analyze the ordered times between failure in relationship
to the system age to describe system reliability
11
Sunday, November 7, 2010
12. USENIX LISA10 November 7, 2010
Time Dependent Reliability
Useful for repairable systems
System can be repaired to satisfactory operation by any action
Failures occur sequentially in time
Measure the age of the components of a system
Need to distinguish age from interarrival times (time between
failures)
Doesn’t have to be precise, resolution of weeks works ok
Some devices report Power On Hours (POH)
SMART for disks
OSes
Clerical solutions or inventory asset systems work fine
12
Sunday, November 7, 2010
13. USENIX LISA10 November 7, 2010
TDR Example 1
13
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
MeanCumulativeFailures
System Age (months)
Disk Set A
Disk Set B
Disk Set C
Target MTBF
Sunday, November 7, 2010
14. USENIX LISA10 November 7, 2010
TDR Example 2
14
Did a common event occur?
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
MeanCumulativeFailures
System Age (months)
Disk Set A
Disk Set B
Disk Set C
Target MTBF
Sunday, November 7, 2010
15. USENIX LISA10 November 7, 2010
TDR Example 2.5
15
0
5
10
15
20
Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014
MeanCumulativeFailures
Date
Sunday, November 7, 2010
16. USENIX LISA10 November 7, 2010
Long Term Storage
Near-line disk systems for backup
Access time and bandwith advantages over tape
Enterprise-class tape for backup and archival
15-30 years shelf life
Significant ECC
Read error rate: 1e-20
Enterprise-class HDD read error rate: 1e-15
16
Sunday, November 7, 2010
17. USENIX LISA10 November 7, 2010
Reliability
17
Reliability is time dependent
TDR analysis reveals trends
Use cumulative plots, mean cumulative plots, and recurrance rates
Graphs are good
Track failures and downtime by system versus age and calendar dates
Corelate anomalous behavior
Manage retirement, refresh, preventative processes using real data
Sunday, November 7, 2010
19. USENIX LISA10 November 7, 2010
Reading Data Sheets
Manufacturers publish useful data sheets and product guides
Reliability information
MTBF or AFR
UER, or equivalent
Warranty
Performance
Interface bandwidth
Sustained bandwidth (aka internal or media bandwidth)
Average rotational delay or rpm (HDD)
Average response or seek time
Native sector size
Environmentals
Power
19
AFR operating hours per year can be a footnote
Sunday, November 7, 2010
21. USENIX LISA10 November 7, 2010
Nines Matter
Is the Internet up?
21
Sunday, November 7, 2010
22. USENIX LISA10 November 7, 2010
Nines Matter
Is the Internet up?
Is the Internet down?
22
Sunday, November 7, 2010
23. USENIX LISA10 November 7, 2010
Nines Matter
Is the Internet up?
Is the Internet down?
Is the Internet reliability 5-9’s?
23
Sunday, November 7, 2010
24. USENIX LISA10 November 7, 2010
Nines Don’t Matter
Is the Internet up?
Is the Internet down?
Is the Internet’s reliability 5-9’s?
Do 5-9’s matter?
24
Sunday, November 7, 2010
25. USENIX LISA10 November 7, 2010
Reliability Matters!
Is the Internet up?
Is the Internet down?
Is the Internet’s reliability 5-9’s?
Do 5-9’s matter?
Reliability matters!
25
Sunday, November 7, 2010
26. USENIX LISA10 November 7, 2010
Designing for Failure
Change design perspective
Design to success
How to make it work?
What you learned in school: solve the equation
Can be difficult...
Design for failure
How to make it work when everything breaks?
What you learned in the army: win the war
Can be difficult... at first...
26
Sunday, November 7, 2010
27. USENIX LISA10 November 7, 2010
HA-Cluster plugin
Example: Design for Success
x86 Server
NexentaStor
Shared
Storage
Shared
Storage
x86 Server
NexentaStor
FC
SAS
iSCSI
Sunday, November 7, 2010
28. USENIX LISA10 November 7, 2010
Designing for Failure
Application-level replication
Hard to implement - coding required
Some activity in open community
Hard to apply to general purpose computing
Examples
DoD, Google, Facebook, Amazon, ...
The big guys
Tends to scale well with size
Multiple copies of data
28
Sunday, November 7, 2010
29. USENIX LISA10 November 7, 2010
Reliability - Availability
Reliability trumps availability
If disks didn’t break, RAID would not exist
If servers didn’t break, HA cluster would not exist
Reliability measured in probabilities
Availability measured in nines
29
Sunday, November 7, 2010
31. USENIX LISA10 November 7, 2010
Evaluating Data Retention
MTTDL = Mean Time To Data Loss
Note: MTBF is not constant in the real world, but keeps math simple
MTTDL[1] is a simple MTTDL model
No parity (single vdev, striping, RAID-0)
MTTDL[1] = MTBF / N
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
Triple Parity (4-way mirror, RAIDZ3)
MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)
31
Sunday, November 7, 2010
32. USENIX LISA10 November 7, 2010
Another MTTDL Model
MTTDL[1] model doesn't take into account unrecoverable read
But unrecoverable reads (UER) are becoming the dominant failure
mode
UER specifed as errors per bits read
More bits = higher probability of loss per vdev
MTTDL[2] model considers UER
32
Sunday, November 7, 2010
33. USENIX LISA10 November 7, 2010
Why Worry about UER?
Richard's study
3,684 hosts with 12,204 LUNs
11.5% of all LUNs reported read errors
Bairavasundaram et.al. FAST08
www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
1.53M LUNs over 41 months
RAID reconstruction discovers 8% of checksum mismatches
“For some drive models as many as 4% of drives develop
checksum mismatches during the 17 months examined”
Manufacturers trade UER for space
33
Sunday, November 7, 2010
34. USENIX LISA10 November 7, 2010
Why Worry about UER?
RAID array study
34
Sunday, November 7, 2010
35. USENIX LISA10 November 7, 2010
Why Worry about UER?
RAID array study
35
Unrecoverable
Reads
Disk Disappeared
“disk pull”
“Disk pull” tests aren’t very useful
Sunday, November 7, 2010
36. USENIX LISA10 November 7, 2010
MTTDL[2] Model
Probability that a reconstruction will fail
Precon_fail = (N-1) * size / UER
Model doesn't work for non-parity schemes
single vdev, striping, RAID-0
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[2] = MTBF / (N * Precon_fail)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
Triple Parity (4-way mirror, RAIDZ3)
MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)
36
Sunday, November 7, 2010
41. USENIX LISA10 November 7, 2010
Space, Dependability, and
Performance
41
Sunday, November 7, 2010
42. USENIX LISA10 November 7, 2010
Dependability Use Case
Customer has 15+ TB of read-mostly data
16-slot, 3.5” drive chassis
2 TB HDDs
Option 1: one raidz2 set
24 TB available space
12 data
2 parity
2 hot spares, 48 hour disk replacement time
MTTDL[1] = 1,790,000 years
Option 2: two raidz2 sets
24 TB available space (each set)
6 data
2 parity
no hot spares
MTTDL[1] = 7,450,000 years
42
Sunday, November 7, 2010
43. USENIX LISA10 November 7, 2010
Planning for Spares
Number of systems Need for spares
How many spares do you need?
How often do you plan replacements?
Replacing devices immediately becomes impractical
Not replacing devices increases risk, but how much?
There is no black/white answer, it depends...
43
Sunday, November 7, 2010
46. USENIX LISA10 November 7, 2010
46
Space
Space is a poor sizing metric, really!
Technology marketing heavily pushes space
Maximizing space can mean compromising performance AND
reliability
As disks and tapes get bigger, they don’t get better
$150 rule
PHB’s get all excited about space
Most current capacity planning tools manage by space
Sunday, November 7, 2010
47. USENIX LISA10 November 7, 2010
Bandwidth
Bandwidth constraints in modern systems are rare
Overprovisioning for bandwidth is relatively simple
Where to gain bandwidth can be tricky
Link aggregation
Ethernet
SAS
MPIO
Adding parallelism beyond 2 trades off reliability
47
Sunday, November 7, 2010
48. USENIX LISA10 November 7, 2010
Latency
Lower latency == better performance
Latency != IOPS
IOPS also achieved with parallelism
Parallelism only delivers latency when latency is constrained by
bandwidth
Latency = access time + transfer time
HDD
Access time limited by seek and rotate
Transfer time usually limited by media or internal bandwidth
SSD
Access time limited by architecture more than c
Transfer time limited by architecture and interface
Tape
Access time measured in seconds
48
Sunday, November 7, 2010
50. USENIX LISA10 November 7, 2010
What is Deduplication?
A $2.1 Billion feature
2009 buzzword of the year
Technique for improving storage space efficiency
Trades big I/Os for small I/Os
Does not eliminate I/O
Implementation styles
offline or post processing
data written to nonvolatile storage
process comes along later and dedupes data
example: tape archive dedup
inline
data is deduped as it is being allocated to nonvolatile storage
example: ZFS
50
Sunday, November 7, 2010
51. USENIX LISA10 November 7, 2010
Dedup how-to
Given a bunch of data
Find data that is duplicated
Build a lookup table of references to data
Replace duplicate data with a pointer to the entry in the lookup table
Grainularity
file
block
byte
51
Sunday, November 7, 2010
52. USENIX LISA10 November 7, 2010
Dedup Constraints
Size of the deduplication table
Quality of the checksums
Collisions happen
All possible permutations of N bits cannot be stored in N/10 bits
Checksums can be evaluated by probability of collisions
Multiple checksums can be used, but gains are marginal
Compression algorithms can work against deduplication
Dedup before or after compression?
52
Sunday, November 7, 2010
53. USENIX LISA10 November 7, 2010
Verification
add reference
checksum
compress
DDT entry lookup
write()
read data
data
match?
new entry
yes
no
verify?
yes
no
yes
no
DDT
match?
53
Sunday, November 7, 2010
54. USENIX LISA10 November 7, 2010
Reference Counts
54
Eggs courtesy of Richard’s chickens
Sunday, November 7, 2010
56. USENIX LISA10 November 7, 2010
Replication Services
Recovery
Point
Objective
System I/O
Performance
Text
Days
Seconds
Slower Faster
Mirror
Application
Level
Replication
Block
Replication
DRBD, SNDR
Object-level sync
Databases, ZFS
File-level sync
rsync
Traditional Backup
NDMP, tar
Hours
56
Sunday, November 7, 2010
57. USENIX LISA10 November 7, 2010
How Many Copies Do You Need?
Answer: at least one, more is better...
One production, one backup
One production, one near-line, one backup
One production, one near-line, one backup, one at DR site
One production, one near-line, one backup, one at DR site, one
archived in a vault
RAID doesn’t count
Consider 3 to 4 as a minimum for important data
57
Sunday, November 7, 2010
58. USENIX LISA10 November 7, 2010
Tiering Example
58
Big, honking
disk array
Big, honking
tape library
File-based
backup
Works great, but...
Sunday, November 7, 2010
59. USENIX LISA10 November 7, 2010
Tiering Example
59
Big, honking
disk array
Big, honking
tape library
File-based
backup
... backups never complete
10 million files
1 million daily changes
12 hour
backup window
Sunday, November 7, 2010
60. USENIX LISA10 November 7, 2010
Tiering Example
60
Big, honking
disk array
Big, honking
tape library
Near-line
backup
Backups to near-line storage and
tape have different policies
10 million files
1 million daily changes
weekly
backup window
hourly block-level
replication
Sunday, November 7, 2010
61. USENIX LISA10 November 7, 2010
Tiering Example
61
Big, honking
disk array
Big, honking
tape library
Near-line
backup
Quick file restoration possible
Sunday, November 7, 2010
62. USENIX LISA10 November 7, 2010
Application-Level Replication
Example
62
Site 2
Long-term
archive option
Site 1
Data stored at
different sites
Site 3
Application
Sunday, November 7, 2010
64. USENIX LISA10 November 7, 2010
Reading Data Sheets Redux
Manufacturers publish useful data sheets and product guides
Reliability information
MTBF or AFR
UER, or equivalent
Warranty
Performance
Interface bandwidth
Sustained bandwidth (aka internal or media bandwidth)
Average rotational delay or rpm (HDD)
Average response or seek time
Native sector size
Environmentals
Power
64
AFR operating hours per year can be a footnote
Sunday, November 7, 2010
66. USENIX LISA10 November 7, 2010
Key Points
66
You will need many copies of your data, get used to it
The cost/byte decreases faster than kicking old habits
Replication is a good thing, use often
Tiering is a good thing, use often
Beware of designing for success, design for failure, too
Reliability trumps availability
Space, dependability, performance: pick two
Sunday, November 7, 2010