SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
USENIX LISA10 November 7, 2010
Techniques for Handling Huge
Storage
Richard.Elling@RichardElling.com
USENIX LISA’10 Conference
November 8, 2010
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Agenda
How did we get here?
When good data goes bad
Capacity, planning, and design
What comes next?
2
Note: this tutorial uses live demos, slides not so much
Sunday, November 7, 2010
3
History
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Milestones in Tape Evolution
4
1951 - magnetic tape for data storage
1964 - 9 track
1972 - Quarter Inch Cartridge (QIC)
1977 - Commodore Datasette
1984 - IBM 3480
1989 - DDS/DAT
1995 - IBM 3590
2000 - T9940
2000 - LTO
2006 - T10000
2008 - TS1130
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Milestones in Disk Evolution
5
1954 - hard disk invented
1950s - Solid state disk invented
1981 - Shugart Associates System Interface (SASI)
1984 - Personal Computer Advanced Technology (PC/AT)Attachment,
later shortened to ATA
1986 - “Small” Computer System Interface (SCSI)
1986 - Integrated Drive Electronics (IDE)
1994 - EIDE
1994 - Fibre Channel (FC)
1995 - Flash-based SSDs
2001 - Serial ATA (SATA)
2005 - Serial Attached SCSI (SAS)
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Architectural Changes
Simple, parallel interfaces
Serial interfaces
Aggregated serial interfaces
6
Sunday, November 7, 2010
7
When Good Data Goes
Bad
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Failure Rates
Mean Time Between Failures (MTBF)
Statistical interarrival error rate
Often cited in literature and data sheets
MTBF = total operating hours / total number of failures
Annualized Failure Rate (AFR)
AFR = operating hours per year / MTBF
Expressed as a percent
Example
MTBF = 1,200,000 hours
Year = 24 x 365 = 8,760 hours
AFR = 8,760 / 1,200,000 = 0.0073 = 0.73%
AFR is easier to grok than MTBF
8
Operating hours per year is a flexible definition
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Multiple Systems and Statistics
Consider 100 systems each with an MTBF = 1,000 hours
At time=1,000 hours, 100 failures occurred
Not all systems will see one failure
9
0
10
20
30
40
0 1 2 3 4
NumberofSystems
Number of Failures
Very, Very Unlucky
Unlucky
Very Unlucky
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Failure Rates
MTBF is a summary metric
Manufacturers estimate MTBF by stressing many units for
short periods of qualification time
Summary metrics hide useful information
Example: mortality study
Study mortality of children aged 5-14 during 1996-1998
Measured 20.8 per 100,000
MTBF = 4,807 years
Current world average life expectancy is 67.2 years
For large populations, such as huge disk farms, the summary
MTBF can appear constant
Better question to be answered, “is my failure rate increasing or
decreasing?”
10
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Do We Care?
Summary statistics, like MTBF or AFR, can me misleading or risky if
we do not also distinguish between stable and trending processes
We need to analyze the ordered times between failure in relationship
to the system age to describe system reliability
11
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Time Dependent Reliability
Useful for repairable systems
System can be repaired to satisfactory operation by any action
Failures occur sequentially in time
Measure the age of the components of a system
Need to distinguish age from interarrival times (time between
failures)
Doesn’t have to be precise, resolution of weeks works ok
Some devices report Power On Hours (POH)
SMART for disks
OSes
Clerical solutions or inventory asset systems work fine
12
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
TDR Example 1
13
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
MeanCumulativeFailures
System Age (months)
Disk Set A
Disk Set B
Disk Set C
Target MTBF
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
TDR Example 2
14
Did a common event occur?
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
MeanCumulativeFailures
System Age (months)
Disk Set A
Disk Set B
Disk Set C
Target MTBF
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
TDR Example 2.5
15
0
5
10
15
20
Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014
MeanCumulativeFailures
Date
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Long Term Storage
Near-line disk systems for backup
Access time and bandwith advantages over tape
Enterprise-class tape for backup and archival
15-30 years shelf life
Significant ECC
Read error rate: 1e-20
Enterprise-class HDD read error rate: 1e-15
16
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reliability
17
Reliability is time dependent
TDR analysis reveals trends
Use cumulative plots, mean cumulative plots, and recurrance rates
Graphs are good
Track failures and downtime by system versus age and calendar dates
Corelate anomalous behavior
Manage retirement, refresh, preventative processes using real data
Sunday, November 7, 2010
18
Data Sheets
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reading Data Sheets
Manufacturers publish useful data sheets and product guides
Reliability information
MTBF or AFR
UER, or equivalent
Warranty
Performance
Interface bandwidth
Sustained bandwidth (aka internal or media bandwidth)
Average rotational delay or rpm (HDD)
Average response or seek time
Native sector size
Environmentals
Power
19
AFR operating hours per year can be a footnote
Sunday, November 7, 2010
20
Availability
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines Matter
Is the Internet up?
21
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines Matter
Is the Internet up?
Is the Internet down?
22
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines Matter
Is the Internet up?
Is the Internet down?
Is the Internet reliability 5-9’s?
23
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines Don’t Matter
Is the Internet up?
Is the Internet down?
Is the Internet’s reliability 5-9’s?
Do 5-9’s matter?
24
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reliability Matters!
Is the Internet up?
Is the Internet down?
Is the Internet’s reliability 5-9’s?
Do 5-9’s matter?
Reliability matters!
25
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Designing for Failure
Change design perspective
Design to success
How to make it work?
What you learned in school: solve the equation
Can be difficult...
Design for failure
How to make it work when everything breaks?
What you learned in the army: win the war
Can be difficult... at first...
26
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
HA-Cluster plugin
Example: Design for Success
x86 Server
NexentaStor
Shared
Storage
Shared
Storage
x86 Server
NexentaStor
FC
SAS
iSCSI
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Designing for Failure
Application-level replication
Hard to implement - coding required
Some activity in open community
Hard to apply to general purpose computing
Examples
DoD, Google, Facebook, Amazon, ...
The big guys
Tends to scale well with size
Multiple copies of data
28
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reliability - Availability
Reliability trumps availability
If disks didn’t break, RAID would not exist
If servers didn’t break, HA cluster would not exist
Reliability measured in probabilities
Availability measured in nines
29
Sunday, November 7, 2010
30
Data Retention
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Evaluating Data Retention
MTTDL = Mean Time To Data Loss
Note: MTBF is not constant in the real world, but keeps math simple
MTTDL[1] is a simple MTTDL model
No parity (single vdev, striping, RAID-0)
MTTDL[1] = MTBF / N
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
Triple Parity (4-way mirror, RAIDZ3)
MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)
31
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Another MTTDL Model
MTTDL[1] model doesn't take into account unrecoverable read
But unrecoverable reads (UER) are becoming the dominant failure
mode
UER specifed as errors per bits read
More bits = higher probability of loss per vdev
MTTDL[2] model considers UER
32
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Worry about UER?
Richard's study
3,684 hosts with 12,204 LUNs
11.5% of all LUNs reported read errors
Bairavasundaram et.al. FAST08
www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
1.53M LUNs over 41 months
RAID reconstruction discovers 8% of checksum mismatches
“For some drive models as many as 4% of drives develop
checksum mismatches during the 17 months examined”
Manufacturers trade UER for space
33
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Worry about UER?
RAID array study
34
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Worry about UER?
RAID array study
35
Unrecoverable
Reads
Disk Disappeared
“disk pull”
“Disk pull” tests aren’t very useful
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL[2] Model
Probability that a reconstruction will fail
Precon_fail = (N-1) * size / UER
Model doesn't work for non-parity schemes
single vdev, striping, RAID-0
Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[2] = MTBF / (N * Precon_fail)
Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
Triple Parity (4-way mirror, RAIDZ3)
MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)
36
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Practical View of MTTDL[1]
37
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL[1] Comparison
38
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL Models: Mirror
39
Spares are not always better...
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL Models: RAIDZ2
40
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Space, Dependability, and
Performance
41
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Dependability Use Case
Customer has 15+ TB of read-mostly data
16-slot, 3.5” drive chassis
2 TB HDDs
Option 1: one raidz2 set
24 TB available space
12 data
2 parity
2 hot spares, 48 hour disk replacement time
MTTDL[1] = 1,790,000 years
Option 2: two raidz2 sets
24 TB available space (each set)
6 data
2 parity
no hot spares
MTTDL[1] = 7,450,000 years
42
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Planning for Spares
Number of systems Need for spares
How many spares do you need?
How often do you plan replacements?
Replacing devices immediately becomes impractical
Not replacing devices increases risk, but how much?
There is no black/white answer, it depends...
43
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
SparesOptimizer Demo
44
Sunday, November 7, 2010
Capacity, Planning, and
Design
45
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
46
Space
Space is a poor sizing metric, really!
Technology marketing heavily pushes space
Maximizing space can mean compromising performance AND
reliability
As disks and tapes get bigger, they don’t get better
$150 rule
PHB’s get all excited about space
Most current capacity planning tools manage by space
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Bandwidth
Bandwidth constraints in modern systems are rare
Overprovisioning for bandwidth is relatively simple
Where to gain bandwidth can be tricky
Link aggregation
Ethernet
SAS
MPIO
Adding parallelism beyond 2 trades off reliability
47
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Latency
Lower latency == better performance
Latency != IOPS
IOPS also achieved with parallelism
Parallelism only delivers latency when latency is constrained by
bandwidth
Latency = access time + transfer time
HDD
Access time limited by seek and rotate
Transfer time usually limited by media or internal bandwidth
SSD
Access time limited by architecture more than c
Transfer time limited by architecture and interface
Tape
Access time measured in seconds
48
Sunday, November 7, 2010
49
Deduplication
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
What is Deduplication?
A $2.1 Billion feature
2009 buzzword of the year
Technique for improving storage space efficiency
Trades big I/Os for small I/Os
Does not eliminate I/O
Implementation styles
offline or post processing
data written to nonvolatile storage
process comes along later and dedupes data
example: tape archive dedup
inline
data is deduped as it is being allocated to nonvolatile storage
example: ZFS
50
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Dedup how-to
Given a bunch of data
Find data that is duplicated
Build a lookup table of references to data
Replace duplicate data with a pointer to the entry in the lookup table
Grainularity
file
block
byte
51
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Dedup Constraints
Size of the deduplication table
Quality of the checksums
Collisions happen
All possible permutations of N bits cannot be stored in N/10 bits
Checksums can be evaluated by probability of collisions
Multiple checksums can be used, but gains are marginal
Compression algorithms can work against deduplication
Dedup before or after compression?
52
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Verification
add reference
checksum
compress
DDT entry lookup
write()
read data
data
match?
new entry
yes
no
verify?
yes
no
yes
no
DDT
match?
53
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reference Counts
54
Eggs courtesy of Richard’s chickens
Sunday, November 7, 2010
55
Replication
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Replication Services
Recovery
Point
Objective
System I/O
Performance
Text
Days
Seconds
Slower Faster
Mirror
Application
Level
Replication
Block
Replication
DRBD, SNDR
Object-level sync
Databases, ZFS
File-level sync
rsync
Traditional Backup
NDMP, tar
Hours
56
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
How Many Copies Do You Need?
Answer: at least one, more is better...
One production, one backup
One production, one near-line, one backup
One production, one near-line, one backup, one at DR site
One production, one near-line, one backup, one at DR site, one
archived in a vault
RAID doesn’t count
Consider 3 to 4 as a minimum for important data
57
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
58
Big, honking
disk array
Big, honking
tape library
File-based
backup
Works great, but...
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
59
Big, honking
disk array
Big, honking
tape library
File-based
backup
... backups never complete
10 million files
1 million daily changes
12 hour
backup window
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
60
Big, honking
disk array
Big, honking
tape library
Near-line
backup
Backups to near-line storage and
tape have different policies
10 million files
1 million daily changes
weekly
backup window
hourly block-level
replication
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
61
Big, honking
disk array
Big, honking
tape library
Near-line
backup
Quick file restoration possible
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Application-Level Replication
Example
62
Site 2
Long-term
archive option
Site 1
Data stored at
different sites
Site 3
Application
Sunday, November 7, 2010
63
Data Sheets
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reading Data Sheets Redux
Manufacturers publish useful data sheets and product guides
Reliability information
MTBF or AFR
UER, or equivalent
Warranty
Performance
Interface bandwidth
Sustained bandwidth (aka internal or media bandwidth)
Average rotational delay or rpm (HDD)
Average response or seek time
Native sector size
Environmentals
Power
64
AFR operating hours per year can be a footnote
Sunday, November 7, 2010
65
Summary
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Key Points
66
You will need many copies of your data, get used to it
The cost/byte decreases faster than kicking old habits
Replication is a good thing, use often
Tiering is a good thing, use often
Beware of designing for success, design for failure, too
Reliability trumps availability
Space, dependability, performance: pick two
Sunday, November 7, 2010
67
ThankYou!
Questions?
Richard.Elling@RichardElling.com
Richard.Elling@Nexenta.com
Sunday, November 7, 2010

Contenu connexe

Similaire à Techniques for Managing Huge Data LISA10

Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)Dustin Kirkland
 
HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)Leo Plugge
 
Scientific Applications with Python
Scientific Applications with PythonScientific Applications with Python
Scientific Applications with PythonEnthought, Inc.
 
Multi dimensional profiling
Multi dimensional profilingMulti dimensional profiling
Multi dimensional profilingbergel
 
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!Ceph Community
 
Managing Cloud Infrastructure Octopus
Managing Cloud Infrastructure   OctopusManaging Cloud Infrastructure   Octopus
Managing Cloud Infrastructure OctopusAntonio Cisternino
 
GeekOn with Ron #4: Tuning and Optimizing Your Gold Image
GeekOn with Ron #4: Tuning and Optimizing Your Gold ImageGeekOn with Ron #4: Tuning and Optimizing Your Gold Image
GeekOn with Ron #4: Tuning and Optimizing Your Gold ImageUnidesk Corporation
 
Building the iRODS Consortium
Building the iRODS ConsortiumBuilding the iRODS Consortium
Building the iRODS ConsortiumAll Things Open
 
PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...
PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...
PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...Toshiharu Harada, Ph.D
 
Lustre Community Release Update
Lustre Community Release UpdateLustre Community Release Update
Lustre Community Release Updateinside-BigData.com
 
Modelling Globalized Systems
Modelling Globalized SystemsModelling Globalized Systems
Modelling Globalized SystemsCecile Germain
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustreTommy Lee
 
Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things PrestoCentre
 
Webinar: Five Reasons Modern Data Centers Need Tape
Webinar: Five Reasons Modern Data Centers Need TapeWebinar: Five Reasons Modern Data Centers Need Tape
Webinar: Five Reasons Modern Data Centers Need TapeStorage Switzerland
 
SDT__valores_recomendados_por_Marcio_673839 (1).pptx
SDT__valores_recomendados_por_Marcio_673839 (1).pptxSDT__valores_recomendados_por_Marcio_673839 (1).pptx
SDT__valores_recomendados_por_Marcio_673839 (1).pptxdiegojdonoso
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Community
 
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext Виталий Стародубцев
 
Productcamp montreal 2010
Productcamp montreal 2010Productcamp montreal 2010
Productcamp montreal 2010Alistair Croll
 

Similaire à Techniques for Managing Huge Data LISA10 (20)

Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)
 
HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)
 
Scientific Applications with Python
Scientific Applications with PythonScientific Applications with Python
Scientific Applications with Python
 
Multi dimensional profiling
Multi dimensional profilingMulti dimensional profiling
Multi dimensional profiling
 
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
 
Life after sun solaris death - open dj - fossa2011
Life after sun solaris death - open dj - fossa2011Life after sun solaris death - open dj - fossa2011
Life after sun solaris death - open dj - fossa2011
 
Managing Cloud Infrastructure Octopus
Managing Cloud Infrastructure   OctopusManaging Cloud Infrastructure   Octopus
Managing Cloud Infrastructure Octopus
 
GeekOn with Ron #4: Tuning and Optimizing Your Gold Image
GeekOn with Ron #4: Tuning and Optimizing Your Gold ImageGeekOn with Ron #4: Tuning and Optimizing Your Gold Image
GeekOn with Ron #4: Tuning and Optimizing Your Gold Image
 
OpenDj Fossa2011
OpenDj Fossa2011OpenDj Fossa2011
OpenDj Fossa2011
 
Building the iRODS Consortium
Building the iRODS ConsortiumBuilding the iRODS Consortium
Building the iRODS Consortium
 
PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...
PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...
PacSec2007: TOMOYO Linux: A Practical Method to Understand and Protect Your O...
 
Lustre Community Release Update
Lustre Community Release UpdateLustre Community Release Update
Lustre Community Release Update
 
Modelling Globalized Systems
Modelling Globalized SystemsModelling Globalized Systems
Modelling Globalized Systems
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre
 
Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things Cost, Risk, Loss and other fun things
Cost, Risk, Loss and other fun things
 
Webinar: Five Reasons Modern Data Centers Need Tape
Webinar: Five Reasons Modern Data Centers Need TapeWebinar: Five Reasons Modern Data Centers Need Tape
Webinar: Five Reasons Modern Data Centers Need Tape
 
SDT__valores_recomendados_por_Marcio_673839 (1).pptx
SDT__valores_recomendados_por_Marcio_673839 (1).pptxSDT__valores_recomendados_por_Marcio_673839 (1).pptx
SDT__valores_recomendados_por_Marcio_673839 (1).pptx
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
 
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
 
Productcamp montreal 2010
Productcamp montreal 2010Productcamp montreal 2010
Productcamp montreal 2010
 

Plus de Richard Elling

Influx db talk-20150415
Influx db talk-20150415Influx db talk-20150415
Influx db talk-20150415Richard Elling
 
S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13Richard Elling
 
Multitenant storage-scale11x
Multitenant storage-scale11xMultitenant storage-scale11x
Multitenant storage-scale11xRichard Elling
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011Richard Elling
 
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a Richard Elling
 
ZFS Tutorial USENIX LISA09 Conference
ZFS Tutorial USENIX LISA09 ConferenceZFS Tutorial USENIX LISA09 Conference
ZFS Tutorial USENIX LISA09 ConferenceRichard Elling
 
ZFS Tutorial USENIX June 2009
ZFS  Tutorial  USENIX June 2009ZFS  Tutorial  USENIX June 2009
ZFS Tutorial USENIX June 2009Richard Elling
 

Plus de Richard Elling (7)

Influx db talk-20150415
Influx db talk-20150415Influx db talk-20150415
Influx db talk-20150415
 
S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13
 
Multitenant storage-scale11x
Multitenant storage-scale11xMultitenant storage-scale11x
Multitenant storage-scale11x
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011
 
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
 
ZFS Tutorial USENIX LISA09 Conference
ZFS Tutorial USENIX LISA09 ConferenceZFS Tutorial USENIX LISA09 Conference
ZFS Tutorial USENIX LISA09 Conference
 
ZFS Tutorial USENIX June 2009
ZFS  Tutorial  USENIX June 2009ZFS  Tutorial  USENIX June 2009
ZFS Tutorial USENIX June 2009
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Techniques for Managing Huge Data LISA10

  • 1. USENIX LISA10 November 7, 2010 Techniques for Handling Huge Storage Richard.Elling@RichardElling.com USENIX LISA’10 Conference November 8, 2010 Sunday, November 7, 2010
  • 2. USENIX LISA10 November 7, 2010 Agenda How did we get here? When good data goes bad Capacity, planning, and design What comes next? 2 Note: this tutorial uses live demos, slides not so much Sunday, November 7, 2010
  • 4. USENIX LISA10 November 7, 2010 Milestones in Tape Evolution 4 1951 - magnetic tape for data storage 1964 - 9 track 1972 - Quarter Inch Cartridge (QIC) 1977 - Commodore Datasette 1984 - IBM 3480 1989 - DDS/DAT 1995 - IBM 3590 2000 - T9940 2000 - LTO 2006 - T10000 2008 - TS1130 Sunday, November 7, 2010
  • 5. USENIX LISA10 November 7, 2010 Milestones in Disk Evolution 5 1954 - hard disk invented 1950s - Solid state disk invented 1981 - Shugart Associates System Interface (SASI) 1984 - Personal Computer Advanced Technology (PC/AT)Attachment, later shortened to ATA 1986 - “Small” Computer System Interface (SCSI) 1986 - Integrated Drive Electronics (IDE) 1994 - EIDE 1994 - Fibre Channel (FC) 1995 - Flash-based SSDs 2001 - Serial ATA (SATA) 2005 - Serial Attached SCSI (SAS) Sunday, November 7, 2010
  • 6. USENIX LISA10 November 7, 2010 Architectural Changes Simple, parallel interfaces Serial interfaces Aggregated serial interfaces 6 Sunday, November 7, 2010
  • 7. 7 When Good Data Goes Bad Sunday, November 7, 2010
  • 8. USENIX LISA10 November 7, 2010 Failure Rates Mean Time Between Failures (MTBF) Statistical interarrival error rate Often cited in literature and data sheets MTBF = total operating hours / total number of failures Annualized Failure Rate (AFR) AFR = operating hours per year / MTBF Expressed as a percent Example MTBF = 1,200,000 hours Year = 24 x 365 = 8,760 hours AFR = 8,760 / 1,200,000 = 0.0073 = 0.73% AFR is easier to grok than MTBF 8 Operating hours per year is a flexible definition Sunday, November 7, 2010
  • 9. USENIX LISA10 November 7, 2010 Multiple Systems and Statistics Consider 100 systems each with an MTBF = 1,000 hours At time=1,000 hours, 100 failures occurred Not all systems will see one failure 9 0 10 20 30 40 0 1 2 3 4 NumberofSystems Number of Failures Very, Very Unlucky Unlucky Very Unlucky Sunday, November 7, 2010
  • 10. USENIX LISA10 November 7, 2010 Failure Rates MTBF is a summary metric Manufacturers estimate MTBF by stressing many units for short periods of qualification time Summary metrics hide useful information Example: mortality study Study mortality of children aged 5-14 during 1996-1998 Measured 20.8 per 100,000 MTBF = 4,807 years Current world average life expectancy is 67.2 years For large populations, such as huge disk farms, the summary MTBF can appear constant Better question to be answered, “is my failure rate increasing or decreasing?” 10 Sunday, November 7, 2010
  • 11. USENIX LISA10 November 7, 2010 Why Do We Care? Summary statistics, like MTBF or AFR, can me misleading or risky if we do not also distinguish between stable and trending processes We need to analyze the ordered times between failure in relationship to the system age to describe system reliability 11 Sunday, November 7, 2010
  • 12. USENIX LISA10 November 7, 2010 Time Dependent Reliability Useful for repairable systems System can be repaired to satisfactory operation by any action Failures occur sequentially in time Measure the age of the components of a system Need to distinguish age from interarrival times (time between failures) Doesn’t have to be precise, resolution of weeks works ok Some devices report Power On Hours (POH) SMART for disks OSes Clerical solutions or inventory asset systems work fine 12 Sunday, November 7, 2010
  • 13. USENIX LISA10 November 7, 2010 TDR Example 1 13 0 5 10 15 20 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950 MeanCumulativeFailures System Age (months) Disk Set A Disk Set B Disk Set C Target MTBF Sunday, November 7, 2010
  • 14. USENIX LISA10 November 7, 2010 TDR Example 2 14 Did a common event occur? 0 5 10 15 20 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950 MeanCumulativeFailures System Age (months) Disk Set A Disk Set B Disk Set C Target MTBF Sunday, November 7, 2010
  • 15. USENIX LISA10 November 7, 2010 TDR Example 2.5 15 0 5 10 15 20 Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014 MeanCumulativeFailures Date Sunday, November 7, 2010
  • 16. USENIX LISA10 November 7, 2010 Long Term Storage Near-line disk systems for backup Access time and bandwith advantages over tape Enterprise-class tape for backup and archival 15-30 years shelf life Significant ECC Read error rate: 1e-20 Enterprise-class HDD read error rate: 1e-15 16 Sunday, November 7, 2010
  • 17. USENIX LISA10 November 7, 2010 Reliability 17 Reliability is time dependent TDR analysis reveals trends Use cumulative plots, mean cumulative plots, and recurrance rates Graphs are good Track failures and downtime by system versus age and calendar dates Corelate anomalous behavior Manage retirement, refresh, preventative processes using real data Sunday, November 7, 2010
  • 19. USENIX LISA10 November 7, 2010 Reading Data Sheets Manufacturers publish useful data sheets and product guides Reliability information MTBF or AFR UER, or equivalent Warranty Performance Interface bandwidth Sustained bandwidth (aka internal or media bandwidth) Average rotational delay or rpm (HDD) Average response or seek time Native sector size Environmentals Power 19 AFR operating hours per year can be a footnote Sunday, November 7, 2010
  • 21. USENIX LISA10 November 7, 2010 Nines Matter Is the Internet up? 21 Sunday, November 7, 2010
  • 22. USENIX LISA10 November 7, 2010 Nines Matter Is the Internet up? Is the Internet down? 22 Sunday, November 7, 2010
  • 23. USENIX LISA10 November 7, 2010 Nines Matter Is the Internet up? Is the Internet down? Is the Internet reliability 5-9’s? 23 Sunday, November 7, 2010
  • 24. USENIX LISA10 November 7, 2010 Nines Don’t Matter Is the Internet up? Is the Internet down? Is the Internet’s reliability 5-9’s? Do 5-9’s matter? 24 Sunday, November 7, 2010
  • 25. USENIX LISA10 November 7, 2010 Reliability Matters! Is the Internet up? Is the Internet down? Is the Internet’s reliability 5-9’s? Do 5-9’s matter? Reliability matters! 25 Sunday, November 7, 2010
  • 26. USENIX LISA10 November 7, 2010 Designing for Failure Change design perspective Design to success How to make it work? What you learned in school: solve the equation Can be difficult... Design for failure How to make it work when everything breaks? What you learned in the army: win the war Can be difficult... at first... 26 Sunday, November 7, 2010
  • 27. USENIX LISA10 November 7, 2010 HA-Cluster plugin Example: Design for Success x86 Server NexentaStor Shared Storage Shared Storage x86 Server NexentaStor FC SAS iSCSI Sunday, November 7, 2010
  • 28. USENIX LISA10 November 7, 2010 Designing for Failure Application-level replication Hard to implement - coding required Some activity in open community Hard to apply to general purpose computing Examples DoD, Google, Facebook, Amazon, ... The big guys Tends to scale well with size Multiple copies of data 28 Sunday, November 7, 2010
  • 29. USENIX LISA10 November 7, 2010 Reliability - Availability Reliability trumps availability If disks didn’t break, RAID would not exist If servers didn’t break, HA cluster would not exist Reliability measured in probabilities Availability measured in nines 29 Sunday, November 7, 2010
  • 31. USENIX LISA10 November 7, 2010 Evaluating Data Retention MTTDL = Mean Time To Data Loss Note: MTBF is not constant in the real world, but keeps math simple MTTDL[1] is a simple MTTDL model No parity (single vdev, striping, RAID-0) MTTDL[1] = MTBF / N Single Parity (mirror, RAIDZ, RAID-1, RAID-5) MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) Double Parity (3-way mirror, RAIDZ2, RAID-6) MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) Triple Parity (4-way mirror, RAIDZ3) MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3) 31 Sunday, November 7, 2010
  • 32. USENIX LISA10 November 7, 2010 Another MTTDL Model MTTDL[1] model doesn't take into account unrecoverable read But unrecoverable reads (UER) are becoming the dominant failure mode UER specifed as errors per bits read More bits = higher probability of loss per vdev MTTDL[2] model considers UER 32 Sunday, November 7, 2010
  • 33. USENIX LISA10 November 7, 2010 Why Worry about UER? Richard's study 3,684 hosts with 12,204 LUNs 11.5% of all LUNs reported read errors Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf 1.53M LUNs over 41 months RAID reconstruction discovers 8% of checksum mismatches “For some drive models as many as 4% of drives develop checksum mismatches during the 17 months examined” Manufacturers trade UER for space 33 Sunday, November 7, 2010
  • 34. USENIX LISA10 November 7, 2010 Why Worry about UER? RAID array study 34 Sunday, November 7, 2010
  • 35. USENIX LISA10 November 7, 2010 Why Worry about UER? RAID array study 35 Unrecoverable Reads Disk Disappeared “disk pull” “Disk pull” tests aren’t very useful Sunday, November 7, 2010
  • 36. USENIX LISA10 November 7, 2010 MTTDL[2] Model Probability that a reconstruction will fail Precon_fail = (N-1) * size / UER Model doesn't work for non-parity schemes single vdev, striping, RAID-0 Single Parity (mirror, RAIDZ, RAID-1, RAID-5) MTTDL[2] = MTBF / (N * Precon_fail) Double Parity (3-way mirror, RAIDZ2, RAID-6) MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) Triple Parity (4-way mirror, RAIDZ3) MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail) 36 Sunday, November 7, 2010
  • 37. USENIX LISA10 November 7, 2010 Practical View of MTTDL[1] 37 Sunday, November 7, 2010
  • 38. USENIX LISA10 November 7, 2010 MTTDL[1] Comparison 38 Sunday, November 7, 2010
  • 39. USENIX LISA10 November 7, 2010 MTTDL Models: Mirror 39 Spares are not always better... Sunday, November 7, 2010
  • 40. USENIX LISA10 November 7, 2010 MTTDL Models: RAIDZ2 40 Sunday, November 7, 2010
  • 41. USENIX LISA10 November 7, 2010 Space, Dependability, and Performance 41 Sunday, November 7, 2010
  • 42. USENIX LISA10 November 7, 2010 Dependability Use Case Customer has 15+ TB of read-mostly data 16-slot, 3.5” drive chassis 2 TB HDDs Option 1: one raidz2 set 24 TB available space 12 data 2 parity 2 hot spares, 48 hour disk replacement time MTTDL[1] = 1,790,000 years Option 2: two raidz2 sets 24 TB available space (each set) 6 data 2 parity no hot spares MTTDL[1] = 7,450,000 years 42 Sunday, November 7, 2010
  • 43. USENIX LISA10 November 7, 2010 Planning for Spares Number of systems Need for spares How many spares do you need? How often do you plan replacements? Replacing devices immediately becomes impractical Not replacing devices increases risk, but how much? There is no black/white answer, it depends... 43 Sunday, November 7, 2010
  • 44. USENIX LISA10 November 7, 2010 SparesOptimizer Demo 44 Sunday, November 7, 2010
  • 46. USENIX LISA10 November 7, 2010 46 Space Space is a poor sizing metric, really! Technology marketing heavily pushes space Maximizing space can mean compromising performance AND reliability As disks and tapes get bigger, they don’t get better $150 rule PHB’s get all excited about space Most current capacity planning tools manage by space Sunday, November 7, 2010
  • 47. USENIX LISA10 November 7, 2010 Bandwidth Bandwidth constraints in modern systems are rare Overprovisioning for bandwidth is relatively simple Where to gain bandwidth can be tricky Link aggregation Ethernet SAS MPIO Adding parallelism beyond 2 trades off reliability 47 Sunday, November 7, 2010
  • 48. USENIX LISA10 November 7, 2010 Latency Lower latency == better performance Latency != IOPS IOPS also achieved with parallelism Parallelism only delivers latency when latency is constrained by bandwidth Latency = access time + transfer time HDD Access time limited by seek and rotate Transfer time usually limited by media or internal bandwidth SSD Access time limited by architecture more than c Transfer time limited by architecture and interface Tape Access time measured in seconds 48 Sunday, November 7, 2010
  • 50. USENIX LISA10 November 7, 2010 What is Deduplication? A $2.1 Billion feature 2009 buzzword of the year Technique for improving storage space efficiency Trades big I/Os for small I/Os Does not eliminate I/O Implementation styles offline or post processing data written to nonvolatile storage process comes along later and dedupes data example: tape archive dedup inline data is deduped as it is being allocated to nonvolatile storage example: ZFS 50 Sunday, November 7, 2010
  • 51. USENIX LISA10 November 7, 2010 Dedup how-to Given a bunch of data Find data that is duplicated Build a lookup table of references to data Replace duplicate data with a pointer to the entry in the lookup table Grainularity file block byte 51 Sunday, November 7, 2010
  • 52. USENIX LISA10 November 7, 2010 Dedup Constraints Size of the deduplication table Quality of the checksums Collisions happen All possible permutations of N bits cannot be stored in N/10 bits Checksums can be evaluated by probability of collisions Multiple checksums can be used, but gains are marginal Compression algorithms can work against deduplication Dedup before or after compression? 52 Sunday, November 7, 2010
  • 53. USENIX LISA10 November 7, 2010 Verification add reference checksum compress DDT entry lookup write() read data data match? new entry yes no verify? yes no yes no DDT match? 53 Sunday, November 7, 2010
  • 54. USENIX LISA10 November 7, 2010 Reference Counts 54 Eggs courtesy of Richard’s chickens Sunday, November 7, 2010
  • 56. USENIX LISA10 November 7, 2010 Replication Services Recovery Point Objective System I/O Performance Text Days Seconds Slower Faster Mirror Application Level Replication Block Replication DRBD, SNDR Object-level sync Databases, ZFS File-level sync rsync Traditional Backup NDMP, tar Hours 56 Sunday, November 7, 2010
  • 57. USENIX LISA10 November 7, 2010 How Many Copies Do You Need? Answer: at least one, more is better... One production, one backup One production, one near-line, one backup One production, one near-line, one backup, one at DR site One production, one near-line, one backup, one at DR site, one archived in a vault RAID doesn’t count Consider 3 to 4 as a minimum for important data 57 Sunday, November 7, 2010
  • 58. USENIX LISA10 November 7, 2010 Tiering Example 58 Big, honking disk array Big, honking tape library File-based backup Works great, but... Sunday, November 7, 2010
  • 59. USENIX LISA10 November 7, 2010 Tiering Example 59 Big, honking disk array Big, honking tape library File-based backup ... backups never complete 10 million files 1 million daily changes 12 hour backup window Sunday, November 7, 2010
  • 60. USENIX LISA10 November 7, 2010 Tiering Example 60 Big, honking disk array Big, honking tape library Near-line backup Backups to near-line storage and tape have different policies 10 million files 1 million daily changes weekly backup window hourly block-level replication Sunday, November 7, 2010
  • 61. USENIX LISA10 November 7, 2010 Tiering Example 61 Big, honking disk array Big, honking tape library Near-line backup Quick file restoration possible Sunday, November 7, 2010
  • 62. USENIX LISA10 November 7, 2010 Application-Level Replication Example 62 Site 2 Long-term archive option Site 1 Data stored at different sites Site 3 Application Sunday, November 7, 2010
  • 64. USENIX LISA10 November 7, 2010 Reading Data Sheets Redux Manufacturers publish useful data sheets and product guides Reliability information MTBF or AFR UER, or equivalent Warranty Performance Interface bandwidth Sustained bandwidth (aka internal or media bandwidth) Average rotational delay or rpm (HDD) Average response or seek time Native sector size Environmentals Power 64 AFR operating hours per year can be a footnote Sunday, November 7, 2010
  • 66. USENIX LISA10 November 7, 2010 Key Points 66 You will need many copies of your data, get used to it The cost/byte decreases faster than kicking old habits Replication is a good thing, use often Tiering is a good thing, use often Beware of designing for success, design for failure, too Reliability trumps availability Space, dependability, performance: pick two Sunday, November 7, 2010