ZFS is a file system, volume manager, and RAID controller combined. It uses a copy-on-write design and checksums data for integrity. ZFS has advantages like speed, simplicity, self-healing capabilities, and built-in features like snapshots and data sharing. ZFS achieves these feats through its layered architecture including the ZPL, DMU, and SPA layers which handle I/O, transactions, block allocation and integrity protection.
How to Troubleshoot Apps for the Modern Connected Worker
Zfs Nuts And Bolts
1. ZFS Nuts and Bolts
Eric Sproul
OmniTI Computer Consulting
2. Quick Overview
• More than just another filesystem: it’s a filesystem,
a volume manager, and a RAID controller all in one
• Production debut in Solaris 10 6/06
• 1 ZB = 1 billion TB
• 128-bit
• 264 snapshots, 248 files/directory,
264 bytes/filesystem, 278 bytes/pool,
264 devices/pool, 264 pools/system
3. Old & Busted
Traditional storage stack:
filesystem(upper): filename to object (inode)
filesystem(lower): object to volume LBA
volume manager: volume LBA to array LBA
RAID controller: array LBA to disk LBA
• Strict separation between layers
• Each layer often comes from separate vendors
• Complex, difficult to administer, hard to predict
performance of a particular combination
4. New Hotness
• Telescoped stack:
ZPL: filename to object
DMU: object to DVA
SPA: DVA to disk LBA
• Terms:
• ZPL: ZFS POSIX layer (standard syscall interface)
• DMU: Data Management Unit (transactional object store)
• DVA: Data Virtual Address (vdev + offset)
• SPA: Storage Pool Allocator (block allocation, data
transformation)
5. New Hotness
• No more separate tools to manage filesystems vs.
volumes vs. RAID arrays
• 2 commands: zpool(1M), zfs(1M) (RFE exists to combine these)
• Pooled storage means never getting stuck with too
much or too little space in your filesystems
• Can expose block devices as well; “zvol” blocks
map directly to DVAs
6. ZFS Advantages
• Fast
• copy-on-write, pipelined I/O, dynamic striping,
variable block size, intelligent resilvering
• Simple management
• End-to-end data integrity, self-healing
• Checksum everything, all the time
• Built-in goodies
• block transforms
• snapshots
• NFS, CIFS, iSCSI sharing
• Platform-neutral on-disk format
7. Getting Down to Brass Tacks
How does ZFS achieve these feats?
8. ZFS I/O Life Cycle
Writes
1. Translated to object transactions by the ZPL:
“Make these 5 changes to these 2 objects.”
2. Transactions bundled in DMU into transaction
groups (TXGs) that flush when full (1/8 of system
memory) or at regular intervals (30 seconds)
3. Blocks making up a TXG are transformed (if
necessary), scheduled and then issued to physical
media in the SPA
9. ZFS I/O Life Cycle
Synchronous Writes
• ZFS maintains a per-filesystem log called the ZFS
Intent Log (ZIL). Each transaction gets a log
sequence number.
• When a synchronous command, such as fsync(), is
issued, the ZIL commits blocks up to the current
sequence number. This is a blocking operation.
• The ZIL commits all necessary operations and
flushes any write caches that may be enabled,
ensuring that all bits have been committed to stable
storage.
10. ZFS I/O Life Cycle
Reads
• ZFS makes heavy use of caching and prefetching
• If requested blocks are not cached, issue a
prioritized I/O that “cuts the line” ahead of pending
writes
• Writes are intelligently throttled to maintain
acceptable read performance
• ARC (Adaptive Replacement Cache) tracks recently
and frequently used blocks in main memory
• L2 ARC uses durable storage to extend the ARC
11. Speed Is Life
• Copy-on-write design means random writes can
be made sequential
• Pipelined I/O extracts maximum parallelism with
out-of-order issue, sorting and aggregation
• Dynamic striping across all underlying devices
eliminates hot-spots
• Variable block size = no wasted space or effort
• Intelligent resilvering copies only live data, can do
partial rebuild for transient outages
15. Copy-On-Write
Atomically update uberblock to point at updated blocks
The uberblock is special in that it does get overwritten, but 4
copies are stored as part of the vdev label and are updated in
transactional pairs. Therefore, integrity on disk is maintained.
16. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
17. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
18. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move
head
19. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin
head wait
20. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move
head wait head
21. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move Move
head wait head head
22. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move Move Move
head wait head head head
23. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
If left in original order, we
waste a lot of time waiting
for head and platter
positioning:
Move Spin Move Move Move
head wait head head head
24. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
Pipelining lets us examine
writes as a group and
optimize order:
25. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
Pipelining lets us examine
writes as a group and
optimize order:
Move
head
26. Pipelined I/O
Reorders writes to be as sequential as possible
App #1 writes:
App #2 writes:
Pipelining lets us examine
writes as a group and
optimize order:
Move Move
head head
27. Dynamic Striping
• Load distribution across top-level vdevs
• Factors determining block allocation
include:
• Capacity
• Latency & bandwidth
• Device health
28. Dynamic Striping
New data striped across three mirrors.
Writes striped across both mirrors. No migration of existing data.
Copy-on-write reallocates data over time,
Reads occur wherever data was gradually spreading it across all three mirrors.
written.
* RFE for “on-demand” resilvering to explicitly re-balance
+
# zpool create tank
# zpool add tank
mirror c1t0d0 c1t1d0
mirror c3t0d0 c3t1d0
mirror c2t0d0 c2t1d0
29. Variable Block Size
• No single value works well with all types of files
• Large blocks increase bandwidth but reduce metadata and can lead to
wasted space
• Small blocks save space for smaller files, but increase I/O operations on
larger ones
• Record-based files such as those used by databases have a fixed block
size that must be matched by the filesystem to avoid extra overhead
(blocks too small) or read-modify-write (blocks too large)
30. Variable Block Size
• The DMU operates on units of a fixed record size;
default is 128KB
• Files that are less than the record size are written as
a single filesystem block (FSB) of variable size in
multiples of disk sectors (512B)
• Files that are larger than the record size are stored
in multiple FSBs equal to record size
• DMU records are assembled into transaction groups
and committed atomically
31. Variable Block Size
• FSBs are the basic unit of ZFS datasets, of which
checksums are maintained
• Handled by the SPA, which can optionally transform
them (compression, ditto blocks today; encryption,
de-dupe in the future)
• Compression improves I/O performance, as fewer
operations are needed on the underlying disk
32. Intelligent Resilver
• a.k.a. rebuild, resync, reconstruct
• Traditional resilvering is basically a whole-disk copy
in the mirror case; RAID-5 does XOR of the other
disks to rebuild
• No priority given to more important blocks
(top of the tree)
• If you’ve copied 99% of the blocks, but the last
1% contains the top few blocks in the tree,
another failure ruins everything
33. Intelligent Resilver
• The ZFS way is metadata-driven
• Live blocks only: just walk the block tree;
unallocated blocks are ignored
• Top-down: Start with the most important blocks.
Every block copied increases the amount of
discoverable data.
• Transactional pruning: If the failure is transient,
repair by identifying the missed TXGs. Resilver
time is only slightly longer than the outage time.
34. Keep It Simple
• Unified management model: pools and datasets
• Datasets are just a group of tagged bits with
certain attributes: filesystems, volumes, snapshots,
clones
• Properties can be set while the dataset is active
• Hierarchical arrangement: children inherit
properties of parent
• Datasets become administration points-- give
every user or application their own filesystem
35. Keep It Simple
• Datasets only occupy as much space as they need
• Compression, quotas and reservations are built-in
properties
• Pools may be grown dynamically without service
interruption
36. Data Integrity
• Not enough to be fast and simple; must be
safe too
• Silent corruption is our mortal enemy
• Defects can occur anywhere: disks, firmware, cables, kernel drivers
• Main memory has ECC; why shouldn’t storage have something similar?
• Other types of corruption are also killers:
• Power outages, accidental overwrite, use a disk as swap
37. Data Integrity
Traditional Method:
Disk Block Checksum
cksum
data
38. Data Integrity
Traditional Method:
Disk Block Checksum
cksum
data
Only detects problems after data is successfully written (“bit rot”)
39. Data Integrity
Traditional Method:
Disk Block Checksum
cksum
data
Only detects problems after data is successfully written (“bit rot”)
Won’t catch silent corruption caused by issues in the I/O path
between disk and host
40. Data Integrity
The ZFS Way
• Store data checksum in parent block
pointer
ptr
cksum • Isolates faults between checksum and
data
ptr • Forms a hash tree, enabling validation of
cksum
the entire pool
• 256-bit checksums
• fletcher2 (default, simple and fast) or
data data SHA-256 (slower, more secure)
• Can be validated at any time with
‘zpool scrub’
46. Data Integrity
App
data
ZFS
data data
Self-healing mirror!
47. Goodie Bag
• Block Transforms
• Snapshots & Clones
• Sharing (NFS, CIFS, iSCSI)
• Platform-neutral on-disk format
48. Block Transforms
• Handled at SPA layer, transparent to upper layers
• Available today:
• Compression
• zfs set compression=on tank/myfs
• LZJB (default) or GZIP
• Multi-threaded as of snv_79
• Duplication, a.k.a. “ditto blocks”
• zfs set copies=N tank/myfs
• In addition to mirroring/RAID-Z: One logical block = up to 3
physical blocks
• Metadata always has 2+ copies, even without ditto blocks
• Copies stored on different devices, or different places on same
device
• Future: de-duplication, encryption
49. Snapshots & Clones
• zfs snapshot tank/myfs@thursday
• Based on block birth time, stored in block pointer
• Nearly instantaneous (<1 sec) on idle system
• Communicates structure, since it is based on
object changes, not just a block delta
• Occupies negligible space initially, and only grows
as large as the block changeset
• Clone is just a read/write snapshot
50. Sharing
• NFSv4
• zfs set sharenfs=on tank/myfs
• Automatically updates /etc/dfs/sharetab
• CIFS
• zfs set sharesmb=on tank/myfs
• Additional properties control the share name and workgroup
• Supports full NT ACLs and user mapping, not just POSIX uid
• iSCSI
• zfs set shareiscsi=on tank/myvol
• Makes sharing block devices as easy as sharing filesystems
51. On-Disk Format
• Platform-neutral, adaptive endianness
• Writes always use native endianness, recorded in a bit in the block
pointer
• Reads byteswap if necessary, based on comparison of host endianness to
value of block pointer bit
• Migrate between x86 and SPARC
• No worries about device paths, fstab, mountpoints, it all just works
• ‘zpool export’ on old host, move disks, ‘zpool import’ on new host
• Also migrate between Solaris and non-Sun implementations, such as
MacOS X and FreeBSD