These are the slides from a tutorial I presented at LOPSA-East in 2013. It covers spinning media and and solid state drives in detail.
A video of the presentation can be found on YouTube: http://www.youtube.com/watch?v=G3wf1HMr6b0
2. Class Overview
• The Evolution of Storage Technology
• Spinning Disks
• Storage Metrics
• Solid State Technology
Saturday, November 16, 13
better understanding of spinning disks, understand high & low level flash, issues with SSDs
11. Disk Interface
• Removable (USB/CF)
• SATA1 / II / III
• Nearline SAS
• SAS
• Fibre Channel
• PCI-e
Saturday, November 16, 13
12. Disk
Interface
•
•
Saturday, November 16, 13
USB
Spinning Disk
Removable Media
Advantages:
•
Nigh Universal
Disadvantages:
•
•
•
•
Slower
Fragile
Easily lost.
Abstraction Layer
13. Disk
Interface
SATA I / II / III
• Speeds: 1.5 / 3 / 6Gb/s
• Requires AHCI for things like NCQ
• Subset of SAS
• Shares IDE command set
Saturday, November 16, 13
AHCI - Advanced Host Controller Interface (IDE is ok for TRIM)
NCQ on SSD ensure SSD has things to do while host is latent
(Intel can queue 32 requests) - logo from SATA-IO (intnl org)
14. Disk
Interface
SATA 3.1
(This time, it’s personal)
• Approved July 2011
• Universal Storage Module
• mSATA
• QTRIM
Saturday, November 16, 13
QTRIM - queued TRIM commands, USM is a mobile drive standard
15. SAS / Nearline SAS
• SAS
• Enhanced CRC checking
• 512/520/528 bit blocks
• Low density, high reliability
• Nearline SAS
• ...not so much
Saturday, November 16, 13
Serially Attached SCSI - 16 bits of CRC
22. Disk
Geometry
Logical Block Addressing
• First introduced as an abstraction layer
• Replaced CHS addressing
• Address Space is Linear (block 0 - n)
• Size of address space depends on the
standard at time of manufacture.
Saturday, November 16, 13
Currently at 48-bit LBA -
29. Spinning Disk
Damage Vectors
Movement
• Movement vertical or parallel to platter
• Measured in G forces
• Head Crashes
• Spinning Down
• Head uses “Landing Strip”
• Repeated platter contact causes damage
to the read/write head
Saturday, November 16, 13
Used to manually park the heads | Putting your computer to sleep can cause the head to park
| nanocoating on the bumpy landing strip
30. Protection against
movement
• “Active Drive Protection”: Free-fall sensor
• Has a lift arm to lift the head away from the
platter
• Some protection systems are in the drive,
some are in the controller
• Don’t mix the two
Saturday, November 16, 13
Apple: Sudden motion sensor, Lenovo/IBM: Hard Drive Active Protection System
Next slide: “You know, vibrations are movement...”
33. Spinning Disk
Damage Vectors
Shelf Life
• Oil / Lubricants in bearings
• Temperature fluctuations
• Magnetic “events” (bit rot)
• Outgassing / vapor removal
Saturday, November 16, 13
Long-term “archival quality” drives with long-life lubricant
Long term “cold storage” arrays which periodically spin up drives every few weeks to clean &
scrub the data
34. Spinning Disks in RAID
• Redundant Array of Inexpensive Disks
• Common RAID levels:
• 0,1,5,6,10
• Software / Hardware
Saturday, November 16, 13
35. Spinning
Disks
In RAID
Important
Considerations
• Redundancy
• Capacity
• Speed
• Robust
Saturday, November 16, 13
Speed: Dedicated hardware? Single point of failure? Parity Calculation? How long to rebuild a
drive? NUMBER OF SPINDLES!! Redundancy: How many drive failures? URE errors? Capacity:
Parity stripe or mirrors? (harder better faster stronger)
39. IOPS
• What are they?
• What aren’t they?
Saturday, November 16, 13
40. The Simplified Equation
IOPS = 1/(((R+W)/2)/1000) + (L/1000)
R = Average Read Time
W = Average Write Time
L = Average Latency
Saturday, November 16, 13
45. NOR Flash
• Reads and writes are atomic single-bit
• Expensive
• Small specific use cases
Saturday, November 16, 13
Won’t talk about NOR much.
46. NAND Flash
• Reads are based on “read blocks” (4k)
• Writes are based on “erasure blocks”
• Cheap (and getting cheaper)
• Broad use cases
Saturday, November 16, 13
47. Read / Write Profiles
• Logical addresses abstracted from LBA
• No seek time
• Reads are generally very fast
• Writes are typically slower
Saturday, November 16, 13
Random and Linear IO have identical access time
Next slide: The magic
49. Quantum Tunneling
(transmission coefficient for a particle tunneling through a single potential barrier)
Saturday, November 16, 13
Hot Carrier Injection
Storage / Erase uses Fowler-Nordheim Tunnel Injection / Release
50. Doped Silicon
Single Layer Cell (SLC)
Multi-Layer Cell (MLC)
Triple-Layer Cell (TLC)
Saturday, November 16, 13
Use charge pumps to get through the barrier
Each charge level has a binary state - 1 or 0
51. Gradual Destruction
Multiple cells need
multiple writes
Energy increases
with cell layers
Barrier accumulates
electrons
Saturday, November 16, 13
Electrical potential
difference of barrier
and cells disappears
53. Density
• 3-Dimensional
• Charge levels
• Size of cells
• “Dot Pitch” (Cells Per Inch)
• 5nm, 3nm, 2nm
• Varies with “level” count
Saturday, November 16, 13
54. SLC / ESLC
• Low Density
• Single (bit) Level Cell
• Quick: 25µs Read / 200-300µ Write
• More robust & long wear time
• Write endurance near 100,000 cycles
Saturday, November 16, 13
Capacity expensive | only in 5nm / 3nm densities |
55. MLC / EMLC
• Reasonably High Density
• Two (bit) Level Cell
• Decently fast: 50µs Read / 600-900µs Write
• Medium lifetime
• Write endurance near 3,000 cycles
Saturday, November 16, 13
56. TLC
• Very High Density
• Three (bit) Level Cell
• Decently fast: 75µs Read / 900-1350µs Write
• Not very robust or durable :-(
• Write endurance ~ 1,000 cycles
Saturday, November 16, 13
70. Remember:
• Spinning Disks
• Linear is fast
• Random is slow
• Read marginally faster than writes
(sometimes)
Saturday, November 16, 13
writes slower when switching tracks
71. With SSDs:
• Reads are fast
• Writes are slow(ish)
• Random or linear doesn’t matter (as much)
Saturday, November 16, 13
72. SSD Performance
Overview
•
Saturday, November 16, 13
Depends on
•
•
•
•
•
•
•
Number of flash chips in use
Number of busses from the processor
Performance of controller CPU
Contention
Bus speed
Number of erasure blocks used
Number of previous writes to flash cells
73. • Chips
• IO Busses
• CPU Cores
Saturday, November 16, 13
74. Causes of Contention
• Legitimate use
• Garbage collection
• Legitimate (but latent) useage
• IO Blender!
(Bender Blender: http://bit.ly/10vc7Sf)
Saturday, November 16, 13
Latent: updatedb? atime? app-level garbage collection?
(t-shirt at threadless)
75. Bus Speed
• SATA - 3 or 6 Gb/s?
• IOPS Calc
• Can your controller handle your disks?
Saturday, November 16, 13
76. Read
• Very fast
• No seek time
• moderately improved over spinning
disk (linear - random greatly)
• Causes no damage to the media
• Generally scales up with capacity
Saturday, November 16, 13
77. Write
• Usually fast (depending on drive usage)
• No seek time
• highly improved over spinning disk
• Causes no damage to the media
• Generally scales up with capacity
Saturday, November 16, 13
88. Flash Chips
Saturday, November 16, 13
If individual chip capacity is finite, how do bigger drives increase capacity? What does this
mean for performance?
89. Flash Controllers
• Flash Translation Layer (FTL)
• Stripe Writes
• Interpret bus instructions
• Wear Leveling
• Garbage Collection
Saturday, November 16, 13
Do the heavy lifting - single largest problem with flash drives, without a doubt.
92. Longevity
• Primarily determined by the class of flash
• (e)SLC, (e)MLC, TLC
• Related to wear-leveling
• Under-reported capacity
• Short-stroking improves lifetime (not speed)
Saturday, November 16, 13
93. Partition Alignment
• Performance and longevity
• As big (or bigger) issue than it was in
spinning disks
• Native 4k read blocks
• Far larger erasure blocks
• larger than is practical for alignment
Saturday, November 16, 13
94. TRIM
• As a command, refers to ATA-8 spec
• SCSI equivalent is UNMAP, but both are
often referred to as TRIM.
• Does not immediately delete unused blocks
• Allows for GC
Saturday, November 16, 13
Linux calls this “discard” - TRIM refers
95. Linux TRIM Support
• EXT4 / XFS / JFS / BTRFS - Native using
‘discard’ option
• Consider NOOP or Deadline IO scheduler
• fstrim (part of util-linux) for R/W vols
• zerofree for R/O vols
Saturday, November 16, 13
fstrim & zerofree - userland - important for thin-provisioned volumes on SAN arrays which
support it. Check docs on schedulers for details - deadline prefers read queues (under /sys/
block)
96. OSX Trim Support
• Comes by default on factory-installed SSDs
• Trim-Enabler
• http://www.groths.org/trim-enabler/
Saturday, November 16, 13
97. ZFS and SSDs
• ZFS Intent Log (ZIL)
• Adaptive Replacement Cache (ARC)
• arc_summary can help you decide
Saturday, November 16, 13
ZIL is almost like a journal - ARC is a RAM cache that has disk backing it. SSDs can be L2ARC
- https://code.google.com/p/jhell/wiki/arc_summary
98. Filesystems in General
•
•
•
Standard journaling filesystems
Mount options (atime/relatime, etc), /tmp->tmpfs
Next-Gen
•
•
•
ZFS / BTRFS
Distributed Filesystems
DRBD
Saturday, November 16, 13
ZFS - SSD cache pool | ZFS/BTRFS are COW | DRBD no trim
99. Monitor Health w/
S.M.A.R.T.
• S.M.A.R.T. information
• vendor-specific
• Includes flash erase count
• smartctl on Linux and Mac
• Dozens of tools on Windows (check wiki)
Saturday, November 16, 13
100. Forensics
...Our results lead to three conclusions:
built-in commands are effective, but
manufacturers sometimes implement them
incorrectly.
First,
overwriting the entire visible address space of
an SSD twice is usually, but not always, sufficient to
Second,
sanitize the drive.
none of the existing hard drive-oriented
techniques for individual file sanitization are
effective on SSDs
Third,
Reliably Erasing Data From Flash-Based Solid State Drives
Michael Wei∗, Laura M. Grupp∗, Frederick E. Spada†, Steven Swanson∗
∗Department of Computer Science and Engineering, University of California, San Diego
†Center for Magnetic Recording and Research, University of California, San Diego
(http://bit.ly/fast11-wei-paper)
Saturday, November 16, 13
102. Hardware / Software
Hardware RAID Controllers
• Dedicated CPU Power • Commercial Support
• Battery-backed storage • Proprietary Tech
Software RAID Controllers
• Trust (eyes on code) • Portability
• Excessive cost of HW • Spare CPU Cycles
Saturday, November 16, 13
single point of failure
103. TRIM / GC?
• Does the RAID software/device know
enough to pass along TRIM?
• Will the array eventually crawl because of
ongoing GC issues?
Saturday, November 16, 13
No software RAID that I know of supports it. Intel chipset for RAID0 with TRIM
104. Access Bandwidth
• How much data can a single drive transmit?
• How many drives are in the array?
• What is the aggregate bus speed to the
array controller?
• What is the bus speed to the host(s)?
Saturday, November 16, 13
106. Controller / Bus
• Speed / Ports
• How mature / reliable / tested?
Remember
Me?
Saturday, November 16, 13
Just because buses exist in a storage array oesn’t make them magic and infinite in size
107. Tiering / Caching
Very fast SDRAM
SSD tier - hot blocks
Faster spinning disks
Very slow, cheap disks
Saturday, November 16, 13