Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010
Techniques for Handling Huge
Storage
Richard.Elling@RichardElling.com
USENIX LISA’10 Conference
November 8, 2010
Sunday, November 7, 2010

Agenda
How did we get here?
When good data goes bad
Capacity, planning, and design
What comes next?
2
Note: this tutorial uses live demos, slides not so much

3
History

Milestones in Tape Evolution
4
1951 - magnetic tape for data storage
1964 - 9 track
1972 - Quarter Inch Cartridge (QIC)
1977 - Commodore Datasette
1984 - IBM 3480
1989 - DDS/DAT
1995 - IBM 3590
2000 - T9940
2000 - LTO
2006 - T10000
2008 - TS1130

Milestones in Disk Evolution
5
1954 - hard disk invented
1950s - Solid state disk invented
1981 - Shugart Associates System Interface (SASI)
1984 - Personal Computer Advanced Technology (PC/AT)Attachment,
later shortened to ATA
1986 - “Small” Computer System Interface (SCSI)
1986 - Integrated Drive Electronics (IDE)
1994 - EIDE
1994 - Fibre Channel (FC)
1995 - Flash-based SSDs
2001 - Serial ATA (SATA)
2005 - Serial Attached SCSI (SAS)

Architectural Changes
Simple, parallel interfaces
Serial interfaces
Aggregated serial interfaces
6

7
When Good Data Goes
Bad

Failure Rates
Mean Time Between Failures (MTBF)
Statistical interarrival error rate
Often cited in literature and data sheets
MTBF = total operating hours / total number of failures
Annualized Failure Rate (AFR)
AFR = operating hours per year / MTBF
Expressed as a percent
Example
MTBF = 1,200,000 hours
Year = 24 x 365 = 8,760 hours
AFR = 8,760 / 1,200,000 = 0.0073 = 0.73%
AFR is easier to grok than MTBF
8
Operating hours per year is a flexible definition

Multiple Systems and Statistics
Consider 100 systems each with an MTBF = 1,000 hours
At time=1,000 hours, 100 failures occurred
Not all systems will see one failure
9
0
10
20
30
40
0 1 2 3 4
NumberofSystems
Number of Failures
Very, Very Unlucky
Unlucky
Very Unlucky

Failure Rates
MTBF is a summary metric
Manufacturers estimate MTBF by stressing many units for
short periods of qualification time
Summary metrics hide useful information
Example: mortality study
Study mortality of children aged 5-14 during 1996-1998
Measured 20.8 per 100,000
MTBF = 4,807 years
Current world average life expectancy is 67.2 years
For large populations, such as huge disk farms, the summary
MTBF can appear constant
Better question to be answered, “is my failure rate increasing or
decreasing?”
10

Why Do We Care?
Summary statistics, like MTBF or AFR, can me misleading or risky if
we do not also distinguish between stable and trending processes
We need to analyze the ordered times between failure in relationship
to the system age to describe system reliability
11

Time Dependent Reliability
Useful for repairable systems
System can be repaired to satisfactory operation by any action
Failures occur sequentially in time
Measure the age of the components of a system
Need to distinguish age from interarrival times (time between
failures)
Doesn’t have to be precise, resolution of weeks works ok
Some devices report Power On Hours (POH)
SMART for disks
OSes
Clerical solutions or inventory asset systems work fine
12

TDR Example 1
13
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
MeanCumulativeFailures
System Age (months)
Disk Set A
Disk Set B
Disk Set C
Target MTBF

TDR Example 2
14
Did a common event occur?
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
System Age (months)
Disk Set A
Disk Set B
Disk Set C
Target MTBF

TDR Example 2.5
15
0
5
10
15
20
Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014
Date

Long Term Storage
Near-line disk systems for backup
Access time and bandwith advantages over tape
Enterprise-class tape for backup and archival
15-30 years shelf life
Significant ECC
Read error rate: 1e-20
Enterprise-class HDD read error rate: 1e-15
16

Reliability
17
Reliability is time dependent
TDR analysis reveals trends
Use cumulative plots, mean cumulative plots, and recurrance rates
Graphs are good
Track failures and downtime by system versus age and calendar dates
Corelate anomalous behavior
Manage retirement, refresh, preventative processes using real data

18
Data Sheets

Reading Data Sheets
Manufacturers publish useful data sheets and product guides
Reliability information
MTBF or AFR
UER, or equivalent
Warranty
Performance
Interface bandwidth
Sustained bandwidth (aka internal or media bandwidth)
Average rotational delay or rpm (HDD)
Average response or seek time
Native sector size
Environmentals
Power
19
AFR operating hours per year can be a footnote

20
Availability

Nines Matter
Is the Internet up?
21

Nines Matter
Is the Internet up?
Is the Internet down?
22

Nines Matter
Is the Internet up?
Is the Internet reliability 5-9’s?
23

Nines Don’t Matter
Is the Internet up?
Is the Internet’s reliability 5-9’s?
Do 5-9’s matter?
24

Reliability Matters!
Is the Internet up?
Is the Internet’s reliability 5-9’s?
Do 5-9’s matter?
Reliability matters!
25

Designing for Failure
Change design perspective
Design to success
How to make it work?
What you learned in school: solve the equation
Can be difficult...
Design for failure
How to make it work when everything breaks?
What you learned in the army: win the war
Can be difficult... at first...
26

HA-Cluster plugin
Example: Design for Success
x86 Server
NexentaStor
Shared
Storage
Shared
Storage
x86 Server
NexentaStor
FC
SAS
iSCSI

Designing for Failure
Application-level replication
Hard to implement - coding required
Some activity in open community
Hard to apply to general purpose computing
Examples
DoD, Google, Facebook, Amazon, ...
The big guys
Tends to scale well with size
Multiple copies of data
28

Reliability - Availability
Reliability trumps availability
If disks didn’t break, RAID would not exist
If servers didn’t break, HA cluster would not exist
Reliability measured in probabilities
Availability measured in nines
29

30
Data Retention

Evaluating Data Retention
MTTDL = Mean Time To Data Loss
Note: MTBF is not constant in the real world, but keeps math simple
MTTDL[1] is a simple MTTDL model
No parity (single vdev, striping, RAID-0)
MTTDL[1] = MTBF / N
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
Triple Parity (4-way mirror, RAIDZ3)
MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)
31

Another MTTDL Model
MTTDL[1] model doesn't take into account unrecoverable read
But unrecoverable reads (UER) are becoming the dominant failure
mode
UER specifed as errors per bits read
More bits = higher probability of loss per vdev
MTTDL[2] model considers UER
32

Why Worry about UER?
Richard's study
3,684 hosts with 12,204 LUNs
11.5% of all LUNs reported read errors
Bairavasundaram et.al. FAST08
www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
1.53M LUNs over 41 months
RAID reconstruction discovers 8% of checksum mismatches
“For some drive models as many as 4% of drives develop
checksum mismatches during the 17 months examined”
Manufacturers trade UER for space
33

RAID array study
34

RAID array study
35
Unrecoverable
Reads
Disk Disappeared
“disk pull”
“Disk pull” tests aren’t very useful

MTTDL[2] Model
Probability that a reconstruction will fail
Precon_fail = (N-1) * size / UER
Model doesn't work for non-parity schemes
single vdev, striping, RAID-0
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[2] = MTBF / (N * Precon_fail)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
Triple Parity (4-way mirror, RAIDZ3)
MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)
36

Practical View of MTTDL[1]
37

MTTDL[1] Comparison
38

MTTDL Models: Mirror
39
Spares are not always better...

MTTDL Models: RAIDZ2
40

Space, Dependability, and
Performance
41

Dependability Use Case
Customer has 15+ TB of read-mostly data
16-slot, 3.5” drive chassis
2 TB HDDs
Option 1: one raidz2 set
24 TB available space
12 data
2 parity
2 hot spares, 48 hour disk replacement time
MTTDL[1] = 1,790,000 years
Option 2: two raidz2 sets
24 TB available space (each set)
6 data
2 parity
no hot spares
MTTDL[1] = 7,450,000 years
42

Planning for Spares
Number of systems Need for spares
How many spares do you need?
How often do you plan replacements?
Replacing devices immediately becomes impractical
Not replacing devices increases risk, but how much?
There is no black/white answer, it depends...
43

SparesOptimizer Demo
44

Capacity, Planning, and
Design
45

46
Space
Space is a poor sizing metric, really!
Technology marketing heavily pushes space
Maximizing space can mean compromising performance AND
reliability
As disks and tapes get bigger, they don’t get better
$150 rule
PHB’s get all excited about space
Most current capacity planning tools manage by space

Bandwidth
Bandwidth constraints in modern systems are rare
Overprovisioning for bandwidth is relatively simple
Where to gain bandwidth can be tricky
Link aggregation
Ethernet
SAS
MPIO
Adding parallelism beyond 2 trades off reliability
47

Latency
Lower latency == better performance
Latency != IOPS
IOPS also achieved with parallelism
Parallelism only delivers latency when latency is constrained by
bandwidth
Latency = access time + transfer time
HDD
Access time limited by seek and rotate
Transfer time usually limited by media or internal bandwidth
SSD
Access time limited by architecture more than c
Transfer time limited by architecture and interface
Tape
Access time measured in seconds
48

49
Deduplication

What is Deduplication?
A $2.1 Billion feature
2009 buzzword of the year
Technique for improving storage space efficiency
Trades big I/Os for small I/Os
Does not eliminate I/O
Implementation styles
offline or post processing
data written to nonvolatile storage
process comes along later and dedupes data
example: tape archive dedup
inline
data is deduped as it is being allocated to nonvolatile storage
example: ZFS
50

Dedup how-to
Given a bunch of data
Find data that is duplicated
Build a lookup table of references to data
Replace duplicate data with a pointer to the entry in the lookup table
Grainularity
file
block
byte
51

Dedup Constraints
Size of the deduplication table
Quality of the checksums
Collisions happen
All possible permutations of N bits cannot be stored in N/10 bits
Checksums can be evaluated by probability of collisions
Multiple checksums can be used, but gains are marginal
Compression algorithms can work against deduplication
Dedup before or after compression?
52

Verification
add reference
checksum
compress
DDT entry lookup
write()
read data
data
match?
new entry
yes
no
verify?
yes
no
yes
no
DDT
match?
53

Reference Counts
54
Eggs courtesy of Richard’s chickens

55
Replication

Replication Services
Recovery
Point
Objective
System I/O
Performance
Text
Days
Seconds
Slower Faster
Mirror
Application
Level
Replication
Block
Replication
DRBD, SNDR
Object-level sync
Databases, ZFS
File-level sync
rsync
Traditional Backup
NDMP, tar
Hours
56

How Many Copies Do You Need?
Answer: at least one, more is better...
One production, one backup
One production, one near-line, one backup
One production, one near-line, one backup, one at DR site
One production, one near-line, one backup, one at DR site, one
archived in a vault
RAID doesn’t count
Consider 3 to 4 as a minimum for important data
57

Tiering Example
58
Big, honking
disk array
Big, honking
tape library
File-based
backup
Works great, but...

Tiering Example
59
Big, honking
disk array
Big, honking
tape library
File-based
backup
... backups never complete
10 million files
1 million daily changes
12 hour
backup window

Tiering Example
60
Big, honking
disk array
Big, honking
tape library
Near-line
backup
Backups to near-line storage and
tape have different policies
10 million files
1 million daily changes
weekly
backup window
hourly block-level
replication

Tiering Example
61
Big, honking
disk array
Big, honking
tape library
Near-line
backup
Quick file restoration possible

Application-Level Replication
Example
62
Site 2
Long-term
archive option
Site 1
Data stored at
different sites
Site 3
Application

63
Data Sheets

Reading Data Sheets Redux
Manufacturers publish useful data sheets and product guides
Reliability information
MTBF or AFR
UER, or equivalent
Warranty
Performance
Interface bandwidth
Sustained bandwidth (aka internal or media bandwidth)
Average rotational delay or rpm (HDD)
Average response or seek time
Native sector size
Environmentals
Power
64
AFR operating hours per year can be a footnote

65
Summary

Key Points
66
You will need many copies of your data, get used to it
The cost/byte decreases faster than kicking old habits
Replication is a good thing, use often
Tiering is a good thing, use often
Beware of designing for success, design for failure, too
Reliability trumps availability
Space, dependability, performance: pick two

67
ThankYou!
Questions?
Richard.Elling@RichardElling.com
Richard.Elling@Nexenta.com

Techniques for Managing Huge Data LISA10

Recommandé

Recommandé

Contenu connexe

Similaire à Techniques for Managing Huge Data LISA10

Similaire à Techniques for Managing Huge Data LISA10 (20)

Plus de Richard Elling

Plus de Richard Elling (7)

Dernier

Dernier (20)

Techniques for Managing Huge Data LISA10