SlideShare une entreprise Scribd logo
1  sur  52
ZFS Nuts and Bolts
          Eric Sproul
  OmniTI Computer Consulting
Quick Overview
•   More than just another filesystem: it’s a filesystem,
    a volume manager, and a RAID controller all in one

•   Production debut in Solaris 10 6/06

•   1 ZB = 1 billion TB

•   128-bit

•   264 snapshots, 248 files/directory,
    264 bytes/filesystem, 278 bytes/pool,
    264 devices/pool, 264 pools/system
Old & Busted
Traditional storage stack:
  filesystem(upper): filename to object (inode)
  filesystem(lower): object to volume LBA
  volume manager: volume LBA to array LBA
  RAID controller: array LBA to disk LBA

• Strict separation between layers
• Each layer often comes from separate vendors
• Complex, difficult to administer, hard to predict
 performance of a particular combination
New Hotness
•   Telescoped stack:
        ZPL: filename to object
        DMU: object to DVA
        SPA: DVA to disk LBA
•   Terms:

    •   ZPL: ZFS POSIX layer (standard syscall interface)

    •   DMU: Data Management Unit (transactional object store)

    •   DVA: Data Virtual Address (vdev + offset)

    •   SPA: Storage Pool Allocator (block allocation, data
        transformation)
New Hotness

•   No more separate tools to manage filesystems vs.
    volumes vs. RAID arrays
    •   2 commands: zpool(1M), zfs(1M) (RFE exists to combine these)

•   Pooled storage means never getting stuck with too
    much or too little space in your filesystems

•   Can expose block devices as well; “zvol” blocks
    map directly to DVAs
ZFS Advantages
•   Fast
    •   copy-on-write, pipelined I/O, dynamic striping,
        variable block size, intelligent resilvering


•   Simple management

•   End-to-end data integrity, self-healing
    •   Checksum everything, all the time

•   Built-in goodies
    •   block transforms

    •   snapshots

    •   NFS, CIFS, iSCSI sharing

    •   Platform-neutral on-disk format
Getting Down to Brass Tacks



 How does ZFS achieve these feats?
ZFS I/O Life Cycle
                       Writes
1. Translated to object transactions by the ZPL:
   “Make these 5 changes to these 2 objects.”
2. Transactions bundled in DMU into transaction
   groups (TXGs) that flush when full (1/8 of system
   memory) or at regular intervals (30 seconds)
3. Blocks making up a TXG are transformed (if
   necessary), scheduled and then issued to physical
   media in the SPA
ZFS I/O Life Cycle
                 Synchronous Writes
•   ZFS maintains a per-filesystem log called the ZFS
    Intent Log (ZIL). Each transaction gets a log
    sequence number.
•   When a synchronous command, such as fsync(), is
    issued, the ZIL commits blocks up to the current
    sequence number. This is a blocking operation.
•   The ZIL commits all necessary operations and
    flushes any write caches that may be enabled,
    ensuring that all bits have been committed to stable
    storage.
ZFS I/O Life Cycle
                          Reads
•   ZFS makes heavy use of caching and prefetching
•   If requested blocks are not cached, issue a
    prioritized I/O that “cuts the line” ahead of pending
    writes
•   Writes are intelligently throttled to maintain
    acceptable read performance
•   ARC (Adaptive Replacement Cache) tracks recently
    and frequently used blocks in main memory
•   L2 ARC uses durable storage to extend the ARC
Speed Is Life
•   Copy-on-write design means random writes can
    be made sequential

•   Pipelined I/O extracts maximum parallelism with
    out-of-order issue, sorting and aggregation

•   Dynamic striping across all underlying devices
    eliminates hot-spots

•   Variable block size = no wasted space or effort

•   Intelligent resilvering copies only live data, can do
    partial rebuild for transient outages
Copy-On-Write




   Initial block tree
Copy-On-Write




New blocks represent changes
 Never modifies existing data
Copy-On-Write




 Indirect blocks also change
Copy-On-Write




Atomically update uberblock to point at updated blocks
          The uberblock is special in that it does get overwritten, but 4
          copies are stored as part of the vdev label and are updated in
          transactional pairs. Therefore, integrity on disk is maintained.
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move
head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin
head     wait
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move
head     wait   head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move   Move
head     wait   head   head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move   Move   Move
head     wait   head   head   head
Pipelined I/O
       Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

If left in original order, we
waste a lot of time waiting
    for head and platter
          positioning:
Move     Spin   Move   Move   Move
head     wait   head   head   head
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

Pipelining lets us examine
  writes as a group and
     optimize order:
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

Pipelining lets us examine
  writes as a group and
     optimize order:
    Move
    head
Pipelined I/O
  Reorders writes to be as sequential as possible

App #1 writes:
App #2 writes:

Pipelining lets us examine
  writes as a group and
     optimize order:
    Move    Move
    head    head
Dynamic Striping
• Load distribution across top-level vdevs
• Factors determining block allocation
  include:
  •   Capacity

  •   Latency & bandwidth

  •   Device health
Dynamic Striping
                                      New data striped across three mirrors.
Writes striped across both mirrors.   No migration of existing data.
                                      Copy-on-write reallocates data over time,
Reads occur wherever data was         gradually spreading it across all three mirrors.
written.
                                      * RFE for “on-demand” resilvering to explicitly re-balance




                                                                                              +
     # zpool create tank 
                                                    # zpool add tank 
     mirror c1t0d0 c1t1d0 
                                                    mirror c3t0d0 c3t1d0
     mirror c2t0d0 c2t1d0
Variable Block Size
•   No single value works well with all types of files
    •   Large blocks increase bandwidth but reduce metadata and can lead to
        wasted space

    •   Small blocks save space for smaller files, but increase I/O operations on
        larger ones

    •   Record-based files such as those used by databases have a fixed block
        size that must be matched by the filesystem to avoid extra overhead
        (blocks too small) or read-modify-write (blocks too large)
Variable Block Size
•   The DMU operates on units of a fixed record size;
    default is 128KB

•   Files that are less than the record size are written as
    a single filesystem block (FSB) of variable size in
    multiples of disk sectors (512B)

•   Files that are larger than the record size are stored
    in multiple FSBs equal to record size

•   DMU records are assembled into transaction groups
    and committed atomically
Variable Block Size

•   FSBs are the basic unit of ZFS datasets, of which
    checksums are maintained

•   Handled by the SPA, which can optionally transform
    them (compression, ditto blocks today; encryption,
    de-dupe in the future)

•   Compression improves I/O performance, as fewer
    operations are needed on the underlying disk
Intelligent Resilver
•   a.k.a. rebuild, resync, reconstruct

•   Traditional resilvering is basically a whole-disk copy
    in the mirror case; RAID-5 does XOR of the other
    disks to rebuild

    •   No priority given to more important blocks
        (top of the tree)

    •   If you’ve copied 99% of the blocks, but the last
        1% contains the top few blocks in the tree,
        another failure ruins everything
Intelligent Resilver
•   The ZFS way is metadata-driven

•   Live blocks only: just walk the block tree;
    unallocated blocks are ignored

•   Top-down: Start with the most important blocks.
    Every block copied increases the amount of
    discoverable data.

•   Transactional pruning: If the failure is transient,
    repair by identifying the missed TXGs. Resilver
    time is only slightly longer than the outage time.
Keep It Simple
•   Unified management model: pools and datasets

•   Datasets are just a group of tagged bits with
    certain attributes: filesystems, volumes, snapshots,
    clones

•   Properties can be set while the dataset is active

•   Hierarchical arrangement: children inherit
    properties of parent

•   Datasets become administration points-- give
    every user or application their own filesystem
Keep It Simple

•   Datasets only occupy as much space as they need

•   Compression, quotas and reservations are built-in
    properties

•   Pools may be grown dynamically without service
    interruption
Data Integrity
• Not enough to be fast and simple; must be
  safe too
• Silent corruption is our mortal enemy
  •   Defects can occur anywhere: disks, firmware, cables, kernel drivers

  •   Main memory has ECC; why shouldn’t storage have something similar?


• Other types of corruption are also killers:
  •   Power outages, accidental overwrite, use a disk as swap
Data Integrity
  Traditional Method:
 Disk Block Checksum




                cksum
         data
Data Integrity
                     Traditional Method:
                    Disk Block Checksum




                                       cksum
                                data



Only detects problems after data is successfully written (“bit rot”)
Data Integrity
                     Traditional Method:
                    Disk Block Checksum




                                       cksum
                                data



Only detects problems after data is successfully written (“bit rot”)

  Won’t catch silent corruption caused by issues in the I/O path
                      between disk and host
Data Integrity
                        The ZFS Way
                               •   Store data checksum in parent block
                                   pointer
                ptr
                cksum          •   Isolates faults between checksum and
                                   data

       ptr                     •   Forms a hash tree, enabling validation of
       cksum
                                   the entire pool

                               •   256-bit checksums

                               •   fletcher2 (default, simple and fast) or
data            data               SHA-256 (slower, more secure)

                               •   Can be validated at any time with
                                   ‘zpool scrub’
Data Integrity
         App


         ZFS




  data         data
Data Integrity
         App


         ZFS
  data




  data         data
Data Integrity
         App


         ZFS
               data




  data         data
Data Integrity
         App
               data


         ZFS




  data         data
Data Integrity
         App
               data


         ZFS




  data         data
Data Integrity
          App
                data


          ZFS




   data         data




  Self-healing mirror!
Goodie Bag

• Block Transforms
• Snapshots & Clones
• Sharing (NFS, CIFS, iSCSI)
• Platform-neutral on-disk format
Block Transforms
•   Handled at SPA layer, transparent to upper layers
•   Available today:
    • Compression
        •   zfs set compression=on tank/myfs
        •   LZJB (default) or GZIP
        •   Multi-threaded as of snv_79

    •   Duplication, a.k.a. “ditto blocks”
        •   zfs set copies=N tank/myfs
        •   In addition to mirroring/RAID-Z: One logical block = up to 3
            physical blocks
        •   Metadata always has 2+ copies, even without ditto blocks
        •   Copies stored on different devices, or different places on same
            device

•   Future: de-duplication, encryption
Snapshots & Clones
•   zfs snapshot tank/myfs@thursday

•   Based on block birth time, stored in block pointer

•   Nearly instantaneous (<1 sec) on idle system

•   Communicates structure, since it is based on
    object changes, not just a block delta

•   Occupies negligible space initially, and only grows
    as large as the block changeset

•   Clone is just a read/write snapshot
Sharing
•   NFSv4
    •   zfs set sharenfs=on tank/myfs
    •   Automatically updates /etc/dfs/sharetab


•   CIFS
    •   zfs set sharesmb=on tank/myfs
    •   Additional properties control the share name and workgroup
    •   Supports full NT ACLs and user mapping, not just POSIX uid


•   iSCSI
    •   zfs set shareiscsi=on tank/myvol
    •   Makes sharing block devices as easy as sharing filesystems
On-Disk Format
• Platform-neutral, adaptive endianness
  •   Writes always use native endianness, recorded in a bit in the block
      pointer

  •   Reads byteswap if necessary, based on comparison of host endianness to
      value of block pointer bit


• Migrate between x86 and SPARC
  •   No worries about device paths, fstab, mountpoints, it all just works

  •   ‘zpool export’ on old host, move disks, ‘zpool import’ on new host

  •   Also migrate between Solaris and non-Sun implementations, such as
      MacOS X and FreeBSD
Fin
Further reading:
ZFS Community:

http://opensolaris.org/os/community/zfs

ZFS Administration Guide:

http://docs.sun.com/app/docs/doc/819-5461

Jeff Bonwick’s blog:

http://blogs.sun.com/bonwick/en_US/category/ZFS

ZFS-related blog entries:

http://blogs.sun.com/main/tags/zfs

Contenu connexe

Tendances

OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensMatthew Ahrens
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemHungWei Chiu
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a Richard Elling
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBshimosawa
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototypingYan Vugenfirer
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephDanielle Womboldt
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Mydbops
 

Tendances (20)

OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a USENIX LISA11 Tutorial: ZFS a
USENIX LISA11 Tutorial: ZFS a
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 

En vedette

Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...ACTUONDA
 
Building Storage on the Cheap
Building Storage on the CheapBuilding Storage on the Cheap
Building Storage on the CheapYao Jun Yap
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011Richard Elling
 
PROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJAPROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJAPedro Hernandez
 
Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja Pedro Hernandez
 

En vedette (7)

Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
Les Nouveaux Enjeux de la Prescription Musicale par Music-Story @ Radio 2.0 P...
 
Biologia: Teorías de la evolución celular
Biologia: Teorías de la evolución celularBiologia: Teorías de la evolución celular
Biologia: Teorías de la evolución celular
 
Building Storage on the Cheap
Building Storage on the CheapBuilding Storage on the Cheap
Building Storage on the Cheap
 
ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011ZFS Tutorial LISA 2011
ZFS Tutorial LISA 2011
 
PROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJAPROYECTO DE ACUERDO COMUNAS EN TUNJA
PROYECTO DE ACUERDO COMUNAS EN TUNJA
 
Documento soporte
Documento soporteDocumento soporte
Documento soporte
 
Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja Respuesta derecho de petición Alcaldia de Tunja
Respuesta derecho de petición Alcaldia de Tunja
 

Similaire à Zfs Nuts And Bolts

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Amazon Web Services
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - SlidesSeveralnines
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailInternet World
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
Development to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersDevelopment to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersSeveralnines
 
Introduction to debugging linux applications
Introduction to debugging linux applicationsIntroduction to debugging linux applications
Introduction to debugging linux applicationscommiebstrd
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformanceProfessionalVMware
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemYahoo Developer Network
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph Community
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
 

Similaire à Zfs Nuts And Bolts (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
os
osos
os
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
ZFS
ZFSZFS
ZFS
 
Development to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersDevelopment to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB Clusters
 
Introduction to debugging linux applications
Introduction to debugging linux applicationsIntroduction to debugging linux applications
Introduction to debugging linux applications
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting Performance
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching ThemKey Challenges in Cloud Computing and How Yahoo! is Approaching Them
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Zfs Nuts And Bolts

  • 1. ZFS Nuts and Bolts Eric Sproul OmniTI Computer Consulting
  • 2. Quick Overview • More than just another filesystem: it’s a filesystem, a volume manager, and a RAID controller all in one • Production debut in Solaris 10 6/06 • 1 ZB = 1 billion TB • 128-bit • 264 snapshots, 248 files/directory, 264 bytes/filesystem, 278 bytes/pool, 264 devices/pool, 264 pools/system
  • 3. Old & Busted Traditional storage stack: filesystem(upper): filename to object (inode) filesystem(lower): object to volume LBA volume manager: volume LBA to array LBA RAID controller: array LBA to disk LBA • Strict separation between layers • Each layer often comes from separate vendors • Complex, difficult to administer, hard to predict performance of a particular combination
  • 4. New Hotness • Telescoped stack: ZPL: filename to object DMU: object to DVA SPA: DVA to disk LBA • Terms: • ZPL: ZFS POSIX layer (standard syscall interface) • DMU: Data Management Unit (transactional object store) • DVA: Data Virtual Address (vdev + offset) • SPA: Storage Pool Allocator (block allocation, data transformation)
  • 5. New Hotness • No more separate tools to manage filesystems vs. volumes vs. RAID arrays • 2 commands: zpool(1M), zfs(1M) (RFE exists to combine these) • Pooled storage means never getting stuck with too much or too little space in your filesystems • Can expose block devices as well; “zvol” blocks map directly to DVAs
  • 6. ZFS Advantages • Fast • copy-on-write, pipelined I/O, dynamic striping, variable block size, intelligent resilvering • Simple management • End-to-end data integrity, self-healing • Checksum everything, all the time • Built-in goodies • block transforms • snapshots • NFS, CIFS, iSCSI sharing • Platform-neutral on-disk format
  • 7. Getting Down to Brass Tacks How does ZFS achieve these feats?
  • 8. ZFS I/O Life Cycle Writes 1. Translated to object transactions by the ZPL: “Make these 5 changes to these 2 objects.” 2. Transactions bundled in DMU into transaction groups (TXGs) that flush when full (1/8 of system memory) or at regular intervals (30 seconds) 3. Blocks making up a TXG are transformed (if necessary), scheduled and then issued to physical media in the SPA
  • 9. ZFS I/O Life Cycle Synchronous Writes • ZFS maintains a per-filesystem log called the ZFS Intent Log (ZIL). Each transaction gets a log sequence number. • When a synchronous command, such as fsync(), is issued, the ZIL commits blocks up to the current sequence number. This is a blocking operation. • The ZIL commits all necessary operations and flushes any write caches that may be enabled, ensuring that all bits have been committed to stable storage.
  • 10. ZFS I/O Life Cycle Reads • ZFS makes heavy use of caching and prefetching • If requested blocks are not cached, issue a prioritized I/O that “cuts the line” ahead of pending writes • Writes are intelligently throttled to maintain acceptable read performance • ARC (Adaptive Replacement Cache) tracks recently and frequently used blocks in main memory • L2 ARC uses durable storage to extend the ARC
  • 11. Speed Is Life • Copy-on-write design means random writes can be made sequential • Pipelined I/O extracts maximum parallelism with out-of-order issue, sorting and aggregation • Dynamic striping across all underlying devices eliminates hot-spots • Variable block size = no wasted space or effort • Intelligent resilvering copies only live data, can do partial rebuild for transient outages
  • 12. Copy-On-Write Initial block tree
  • 13. Copy-On-Write New blocks represent changes Never modifies existing data
  • 15. Copy-On-Write Atomically update uberblock to point at updated blocks The uberblock is special in that it does get overwritten, but 4 copies are stored as part of the vdev label and are updated in transactional pairs. Therefore, integrity on disk is maintained.
  • 16. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes:
  • 17. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning:
  • 18. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move head
  • 19. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin head wait
  • 20. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move head wait head
  • 21. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move head wait head head
  • 22. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
  • 23. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
  • 24. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order:
  • 25. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move head
  • 26. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move Move head head
  • 27. Dynamic Striping • Load distribution across top-level vdevs • Factors determining block allocation include: • Capacity • Latency & bandwidth • Device health
  • 28. Dynamic Striping New data striped across three mirrors. Writes striped across both mirrors. No migration of existing data. Copy-on-write reallocates data over time, Reads occur wherever data was gradually spreading it across all three mirrors. written. * RFE for “on-demand” resilvering to explicitly re-balance + # zpool create tank # zpool add tank mirror c1t0d0 c1t1d0 mirror c3t0d0 c3t1d0 mirror c2t0d0 c2t1d0
  • 29. Variable Block Size • No single value works well with all types of files • Large blocks increase bandwidth but reduce metadata and can lead to wasted space • Small blocks save space for smaller files, but increase I/O operations on larger ones • Record-based files such as those used by databases have a fixed block size that must be matched by the filesystem to avoid extra overhead (blocks too small) or read-modify-write (blocks too large)
  • 30. Variable Block Size • The DMU operates on units of a fixed record size; default is 128KB • Files that are less than the record size are written as a single filesystem block (FSB) of variable size in multiples of disk sectors (512B) • Files that are larger than the record size are stored in multiple FSBs equal to record size • DMU records are assembled into transaction groups and committed atomically
  • 31. Variable Block Size • FSBs are the basic unit of ZFS datasets, of which checksums are maintained • Handled by the SPA, which can optionally transform them (compression, ditto blocks today; encryption, de-dupe in the future) • Compression improves I/O performance, as fewer operations are needed on the underlying disk
  • 32. Intelligent Resilver • a.k.a. rebuild, resync, reconstruct • Traditional resilvering is basically a whole-disk copy in the mirror case; RAID-5 does XOR of the other disks to rebuild • No priority given to more important blocks (top of the tree) • If you’ve copied 99% of the blocks, but the last 1% contains the top few blocks in the tree, another failure ruins everything
  • 33. Intelligent Resilver • The ZFS way is metadata-driven • Live blocks only: just walk the block tree; unallocated blocks are ignored • Top-down: Start with the most important blocks. Every block copied increases the amount of discoverable data. • Transactional pruning: If the failure is transient, repair by identifying the missed TXGs. Resilver time is only slightly longer than the outage time.
  • 34. Keep It Simple • Unified management model: pools and datasets • Datasets are just a group of tagged bits with certain attributes: filesystems, volumes, snapshots, clones • Properties can be set while the dataset is active • Hierarchical arrangement: children inherit properties of parent • Datasets become administration points-- give every user or application their own filesystem
  • 35. Keep It Simple • Datasets only occupy as much space as they need • Compression, quotas and reservations are built-in properties • Pools may be grown dynamically without service interruption
  • 36. Data Integrity • Not enough to be fast and simple; must be safe too • Silent corruption is our mortal enemy • Defects can occur anywhere: disks, firmware, cables, kernel drivers • Main memory has ECC; why shouldn’t storage have something similar? • Other types of corruption are also killers: • Power outages, accidental overwrite, use a disk as swap
  • 37. Data Integrity Traditional Method: Disk Block Checksum cksum data
  • 38. Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written (“bit rot”)
  • 39. Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written (“bit rot”) Won’t catch silent corruption caused by issues in the I/O path between disk and host
  • 40. Data Integrity The ZFS Way • Store data checksum in parent block pointer ptr cksum • Isolates faults between checksum and data ptr • Forms a hash tree, enabling validation of cksum the entire pool • 256-bit checksums • fletcher2 (default, simple and fast) or data data SHA-256 (slower, more secure) • Can be validated at any time with ‘zpool scrub’
  • 41. Data Integrity App ZFS data data
  • 42. Data Integrity App ZFS data data data
  • 43. Data Integrity App ZFS data data data
  • 44. Data Integrity App data ZFS data data
  • 45. Data Integrity App data ZFS data data
  • 46. Data Integrity App data ZFS data data Self-healing mirror!
  • 47. Goodie Bag • Block Transforms • Snapshots & Clones • Sharing (NFS, CIFS, iSCSI) • Platform-neutral on-disk format
  • 48. Block Transforms • Handled at SPA layer, transparent to upper layers • Available today: • Compression • zfs set compression=on tank/myfs • LZJB (default) or GZIP • Multi-threaded as of snv_79 • Duplication, a.k.a. “ditto blocks” • zfs set copies=N tank/myfs • In addition to mirroring/RAID-Z: One logical block = up to 3 physical blocks • Metadata always has 2+ copies, even without ditto blocks • Copies stored on different devices, or different places on same device • Future: de-duplication, encryption
  • 49. Snapshots & Clones • zfs snapshot tank/myfs@thursday • Based on block birth time, stored in block pointer • Nearly instantaneous (<1 sec) on idle system • Communicates structure, since it is based on object changes, not just a block delta • Occupies negligible space initially, and only grows as large as the block changeset • Clone is just a read/write snapshot
  • 50. Sharing • NFSv4 • zfs set sharenfs=on tank/myfs • Automatically updates /etc/dfs/sharetab • CIFS • zfs set sharesmb=on tank/myfs • Additional properties control the share name and workgroup • Supports full NT ACLs and user mapping, not just POSIX uid • iSCSI • zfs set shareiscsi=on tank/myvol • Makes sharing block devices as easy as sharing filesystems
  • 51. On-Disk Format • Platform-neutral, adaptive endianness • Writes always use native endianness, recorded in a bit in the block pointer • Reads byteswap if necessary, based on comparison of host endianness to value of block pointer bit • Migrate between x86 and SPARC • No worries about device paths, fstab, mountpoints, it all just works • ‘zpool export’ on old host, move disks, ‘zpool import’ on new host • Also migrate between Solaris and non-Sun implementations, such as MacOS X and FreeBSD
  • 52. Fin Further reading: ZFS Community: http://opensolaris.org/os/community/zfs ZFS Administration Guide: http://docs.sun.com/app/docs/doc/819-5461 Jeff Bonwick’s blog: http://blogs.sun.com/bonwick/en_US/category/ZFS ZFS-related blog entries: http://blogs.sun.com/main/tags/zfs

Notes de l'éditeur