SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
The Btrfs
Filesystem

Chris Mason
The Btrfs Filesystem

 •   Jointly developed by a number of companies
         Oracle, Redhat, Fujitsu, Intel, SUSE, many others
 •   All data and metadata is written via copy-on-write
 •   CRCs maintained for all metadata and data
 •   Efficient writable snapshots
 •   Multi-device support
 •   Online resize and defrag
 •   Transparent compression
 •   Efficient storage for small files
 •   SSD optimizations and trim support
Btrfs Progress


 •   Extensive performance and stability fixes
 •   New and improving repair tool
 •   Background scrubbing
 •   Automatic repair of corrupt blocks
 •   RAID restriping
 •   Configurable metadata block sizes
 •   Improved IO error handling infrastructure
The Btrfs Structures


 •   Btrfs only stores one kind of metadata block
 •   Btree blocks store key/value pairs
 •   Metadata structures use specific keys to store related items
     close together on disk
 •   Logical address layer translates data and metadata blocks to
     physical areas of the storage
 •   Metadata for different files and directories can all be stored in
     the same btree block
Storage Allocation

         Extent Allocation Tree
 Block          Block          Block       •   Storage allocated in chunks
 Group 1        Group 2        Group 3
                                               to create specific raid levels
Chunk Tree: Logical -> Physical Map
                                           •   Data and metadata can
 Chunk 1        Chunk 2        Chunk 3         have different raid levels
 RAID0          RAID1          RAID0

Disk 1     Disk 2     Disk 3      Disk 4



                       Free       Free
New Storage Technologies – Flash



 •   Flash lifetime is limited by the number of write cycles to a cell
 •   Small writes require internal read/modify/write cycles in the
     device
         A 1MB write might count as small
         Each small write may be amplified into multiple larger writes
 •   Flash lifetime can be increased if the FS works together with
     the device
Hints For the Storage

 •   Discard and Trim allow the device to ignore blocks the FS
     isn’t using
 •   Devices may be tiered internally
         Frequently modified or deleted blocks stay on faster cells
         Long lived blocks moved to less expensive storage
 •   New APIs and standards will allow the FS to give hints to the
     device
 •   Large arrays and high end flash can use the hints to improve
     performance
 •   Low end flash can use hints to increase cell lifetime
 •   Btrfs block group layout separates shorter lived metadata
     from data
Discard/Trim



 •   Trim and discard notify storage when we are done with a block
 •   Btrfs supports both real-time trim and batched trim
 •   Real-time trims blocks as they are freed
 •   Batched trims all free space via an ioctl
 •   Newer kernels will have less penalty for online discard
Btrfs Restriper



 •   Newly introduced in 3.3
 •   Advanced control over block groups and storage
 •   Balance data across drives to select new RAID levels
 •   Balance filtering by usage for thin provisioning
 •   Ex: btrfs filesystem balance start -dconvert=raid1
When Bad Things Happen to Good Data


 •   Barrier bugs in Btrfs lead to most of the corruptions seen with
     kernels before v3.2
 •   Filesystem repair tool in btrfs-progs git
         Repairs extent allocation tree corruptions in place
         More repair modes in progress
 •   Filesystem recovery tool from Josef Bacik
         Risk free – copies data out of the corrupt FS
 •   Tree root history log to recover from many hardware errors
         Jumps back to older versions of the tree roots
Larger Metadata Blocks


 •   Btrfs btree uses key ordering to group related items into the
     same metadata block
 •   COW tends to fragment the btree over time
 •   Larger blocksizes provide very inexpensive btree
     defragmentation
 •   Larger blocksizes reduce extent allocation overhead (fewer
     extents)
 •   Larger blocksizes allow metadata on raid5/6
Metadata Blocksizes and Writeback

 •   Btrfs COW tends to allocate and free pages often
 •   The Linux VM was keeping our stale pages on the LRU too
     long
 •   In metadata heavy workloads Btrfs did many more reads than
     it should have
 •   After fixing metadata caching and enabling larger blocks:
         Create 32 million empty files
         Btrfs – 170K files/sec (16KB metadata)
         Btrfs – 150K files/sec (4KB metadata after LRU fixes )
         XFS – 115K files/sec
         Ext4 – 110K files/sec (256MB log)
IO Animations



 •   Ext4 is bottlenecked on reading the inode tables
 •   Both XFS and Btrfs are CPU bound
 •   XFS is walking forward through a series of distinct disk areas
 •   Both XFS and Ext4 show heavy log activity
 •   Btrfs is doing sequential writes and some random reads
Scrub


 •   Btrfs CRCs allow us to verify data stored on disk
 •   CRC errors can be corrected by reading a good copy of the
     block from another drive
 •   Scrubbing code scans the allocated data and metadata blocks
 •   Any CRC errors are fixed during the scan if a second copy
     exists
 •   Will be extended to track and offline bad devices
Seed Devices



 •   A readonly device can be used as a filesystem seed
 •   Read/write devices can be added to store modifications
 •   Changes to the writable devices are persistent across reboots
 •   The readonly device can be removed at any time
 •   Multiple read/write filesystems can be built from the same
     seed
Embedded Systems




 •   Btrfs is fairly friendly to small machines
 •   Very little memory is pinned by the filesystem
 •   Btrfs works very well overall on low end flash
Thank You!


 •   Chris Mason <chris.mason@oracle.com>
 •   http://btrfs.wiki.kernel.org
 •   http://oracle.com/linux
 •   Free OTN SysAmdin Training
          Tuesday, April 10 8am-4pm
          Oracle Linux configuration and management (Btrfs included)
          Oracle Santa Clara Offices, CA

Contenu connexe

Tendances

LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFSLUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFSMarian Marinov
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSTomas Vondra
 
PostgreSQL on ZFS Lightning Talk
PostgreSQL on ZFS Lightning TalkPostgreSQL on ZFS Lightning Talk
PostgreSQL on ZFS Lightning TalkSean Chittenden
 
Feature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with EncryptionFeature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with EncryptionLF Events
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesSean Chittenden
 
NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSFbergwolf
 
Filesystem Showdown: What a Difference a Decade Makes
Filesystem Showdown: What a Difference a Decade MakesFilesystem Showdown: What a Difference a Decade Makes
Filesystem Showdown: What a Difference a Decade MakesPerforce
 
RAID, Replication, and You
RAID, Replication, and YouRAID, Replication, and You
RAID, Replication, and YouGreat Wide Open
 
S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13Richard Elling
 
11 linux filesystem copy
11 linux filesystem copy11 linux filesystem copy
11 linux filesystem copyShay Cohen
 
Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012Dru Lavigne
 
StartUpOpen 2011 - Projekat13
StartUpOpen 2011 - Projekat13StartUpOpen 2011 - Projekat13
StartUpOpen 2011 - Projekat13BlogOpen
 

Tendances (18)

LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFSLUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
PostgreSQL on ZFS Lightning Talk
PostgreSQL on ZFS Lightning TalkPostgreSQL on ZFS Lightning Talk
PostgreSQL on ZFS Lightning Talk
 
MySQL on ZFS
MySQL on ZFSMySQL on ZFS
MySQL on ZFS
 
Feature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with EncryptionFeature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with Encryption
 
BiTteRsweet FS
BiTteRsweet FSBiTteRsweet FS
BiTteRsweet FS
 
PostgreSQL + ZFS best practices
PostgreSQL + ZFS best practicesPostgreSQL + ZFS best practices
PostgreSQL + ZFS best practices
 
NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSF
 
Filesystem Showdown: What a Difference a Decade Makes
Filesystem Showdown: What a Difference a Decade MakesFilesystem Showdown: What a Difference a Decade Makes
Filesystem Showdown: What a Difference a Decade Makes
 
RAID, Replication, and You
RAID, Replication, and YouRAID, Replication, and You
RAID, Replication, and You
 
S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13S8 File Systems Tutorial USENIX LISA13
S8 File Systems Tutorial USENIX LISA13
 
11 linux filesystem copy
11 linux filesystem copy11 linux filesystem copy
11 linux filesystem copy
 
ZFS Talk Part 1
ZFS Talk Part 1ZFS Talk Part 1
ZFS Talk Part 1
 
Flourish16
Flourish16Flourish16
Flourish16
 
Fossetcon14
Fossetcon14Fossetcon14
Fossetcon14
 
Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012
 
Fsoss2011
Fsoss2011Fsoss2011
Fsoss2011
 
StartUpOpen 2011 - Projekat13
StartUpOpen 2011 - Projekat13StartUpOpen 2011 - Projekat13
StartUpOpen 2011 - Projekat13
 

En vedette

Cities social issues
Cities social issuesCities social issues
Cities social issuesdwessler
 
I can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfsI can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfsAvi Miller
 
Sheepdog- Google Webinar
Sheepdog- Google Webinar Sheepdog- Google Webinar
Sheepdog- Google Webinar Sheepdog
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...BertrandDrouvot
 
Sheepdog: yet another all in-one storage for openstack
Sheepdog: yet another all in-one storage for openstackSheepdog: yet another all in-one storage for openstack
Sheepdog: yet another all in-one storage for openstackLiu Yuan
 
File System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuFile System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuJayesh Tambe
 
Container Storage Best Practices in 2017
Container Storage Best Practices in 2017Container Storage Best Practices in 2017
Container Storage Best Practices in 2017Keith Resar
 

En vedette (9)

Cities social issues
Cities social issuesCities social issues
Cities social issues
 
I can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfsI can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfs
 
Sheepdog- Google Webinar
Sheepdog- Google Webinar Sheepdog- Google Webinar
Sheepdog- Google Webinar
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
 
Sheepdog: yet another all in-one storage for openstack
Sheepdog: yet another all in-one storage for openstackSheepdog: yet another all in-one storage for openstack
Sheepdog: yet another all in-one storage for openstack
 
File System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuFile System Comparison on Linux Ubuntu
File System Comparison on Linux Ubuntu
 
Container Storage Best Practices in 2017
Container Storage Best Practices in 2017Container Storage Best Practices in 2017
Container Storage Best Practices in 2017
 
ZFS in 30 minutes
ZFS in 30 minutesZFS in 30 minutes
ZFS in 30 minutes
 
ZFS et BTRFS
ZFS et BTRFSZFS et BTRFS
ZFS et BTRFS
 

Similaire à Btrfs by Chris Mason

Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Ahmed El-Arabawy
 
Working of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memoryWorking of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memoryDon Caeiro
 
19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdf19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdfJESUNPK
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf3operatordcslipiPeng
 
storage techniques_overview-1.pptx
storage techniques_overview-1.pptxstorage techniques_overview-1.pptx
storage techniques_overview-1.pptx20CS102RAMMPRASHATHK
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
Windows Forensics- Introduction and Analysis
Windows Forensics- Introduction and AnalysisWindows Forensics- Introduction and Analysis
Windows Forensics- Introduction and AnalysisDon Caeiro
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
File system and Deadlocks
File system and DeadlocksFile system and Deadlocks
File system and DeadlocksRohit Jain
 
04.01 file organization
04.01 file organization04.01 file organization
04.01 file organizationBishal Ghimire
 

Similaire à Btrfs by Chris Mason (20)

Os
OsOs
Os
 
Ext filesystem4
Ext filesystem4Ext filesystem4
Ext filesystem4
 
Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)
 
Working of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memoryWorking of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memory
 
19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdf19IS305_U4_LP10_LM10-22-23.pdf
19IS305_U4_LP10_LM10-22-23.pdf
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf
 
Windows File Systems
Windows File SystemsWindows File Systems
Windows File Systems
 
Thiru
ThiruThiru
Thiru
 
Lect09
Lect09Lect09
Lect09
 
kerch04.ppt
kerch04.pptkerch04.ppt
kerch04.ppt
 
Storage Managment
Storage ManagmentStorage Managment
Storage Managment
 
Windows File Systems
Windows File SystemsWindows File Systems
Windows File Systems
 
storage techniques_overview-1.pptx
storage techniques_overview-1.pptxstorage techniques_overview-1.pptx
storage techniques_overview-1.pptx
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
Windows Forensics- Introduction and Analysis
Windows Forensics- Introduction and AnalysisWindows Forensics- Introduction and Analysis
Windows Forensics- Introduction and Analysis
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
Computer details
Computer detailsComputer details
Computer details
 
Disk
DiskDisk
Disk
 
File system and Deadlocks
File system and DeadlocksFile system and Deadlocks
File system and Deadlocks
 
04.01 file organization
04.01 file organization04.01 file organization
04.01 file organization
 

Plus de Terry Wang

Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeTerry Wang
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxTerry Wang
 
Oracle Linux Nov 2011 Webcast
Oracle Linux Nov 2011 WebcastOracle Linux Nov 2011 Webcast
Oracle Linux Nov 2011 WebcastTerry Wang
 
Oracle Buys Ksplice
Oracle Buys KspliceOracle Buys Ksplice
Oracle Buys KspliceTerry Wang
 
Eliminating Silent Data Corruption with Oracle Linux
Eliminating Silent Data Corruption with Oracle Linux Eliminating Silent Data Corruption with Oracle Linux
Eliminating Silent Data Corruption with Oracle Linux Terry Wang
 
我的 Ubuntu 之旅
我的 Ubuntu 之旅我的 Ubuntu 之旅
我的 Ubuntu 之旅Terry Wang
 
Get the Facts: Oracle's Unbreakable Enterprise Kernel
Get the Facts: Oracle's Unbreakable Enterprise KernelGet the Facts: Oracle's Unbreakable Enterprise Kernel
Get the Facts: Oracle's Unbreakable Enterprise KernelTerry Wang
 
Git 101 tutorial presentation
Git 101 tutorial presentationGit 101 tutorial presentation
Git 101 tutorial presentationTerry Wang
 
Git 入门 与 实践
Git 入门 与 实践Git 入门 与 实践
Git 入门 与 实践Terry Wang
 
Git 入门与实践
Git 入门与实践Git 入门与实践
Git 入门与实践Terry Wang
 
WCI 10gR3 overview
WCI 10gR3 overviewWCI 10gR3 overview
WCI 10gR3 overviewTerry Wang
 

Plus de Terry Wang (14)

Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle Linux
 
Oracle Linux Nov 2011 Webcast
Oracle Linux Nov 2011 WebcastOracle Linux Nov 2011 Webcast
Oracle Linux Nov 2011 Webcast
 
RHEL roadmap
RHEL roadmapRHEL roadmap
RHEL roadmap
 
Oracle Buys Ksplice
Oracle Buys KspliceOracle Buys Ksplice
Oracle Buys Ksplice
 
Eliminating Silent Data Corruption with Oracle Linux
Eliminating Silent Data Corruption with Oracle Linux Eliminating Silent Data Corruption with Oracle Linux
Eliminating Silent Data Corruption with Oracle Linux
 
我的 Ubuntu 之旅
我的 Ubuntu 之旅我的 Ubuntu 之旅
我的 Ubuntu 之旅
 
Get the Facts: Oracle's Unbreakable Enterprise Kernel
Get the Facts: Oracle's Unbreakable Enterprise KernelGet the Facts: Oracle's Unbreakable Enterprise Kernel
Get the Facts: Oracle's Unbreakable Enterprise Kernel
 
Git 101 tutorial presentation
Git 101 tutorial presentationGit 101 tutorial presentation
Git 101 tutorial presentation
 
Git
GitGit
Git
 
Git 入门 与 实践
Git 入门 与 实践Git 入门 与 实践
Git 入门 与 实践
 
Git 入门与实践
Git 入门与实践Git 入门与实践
Git 入门与实践
 
WCI 10gR3 overview
WCI 10gR3 overviewWCI 10gR3 overview
WCI 10gR3 overview
 
ALUI 6.5
ALUI 6.5ALUI 6.5
ALUI 6.5
 

Btrfs by Chris Mason

  • 2. The Btrfs Filesystem • Jointly developed by a number of companies Oracle, Redhat, Fujitsu, Intel, SUSE, many others • All data and metadata is written via copy-on-write • CRCs maintained for all metadata and data • Efficient writable snapshots • Multi-device support • Online resize and defrag • Transparent compression • Efficient storage for small files • SSD optimizations and trim support
  • 3. Btrfs Progress • Extensive performance and stability fixes • New and improving repair tool • Background scrubbing • Automatic repair of corrupt blocks • RAID restriping • Configurable metadata block sizes • Improved IO error handling infrastructure
  • 4. The Btrfs Structures • Btrfs only stores one kind of metadata block • Btree blocks store key/value pairs • Metadata structures use specific keys to store related items close together on disk • Logical address layer translates data and metadata blocks to physical areas of the storage • Metadata for different files and directories can all be stored in the same btree block
  • 5. Storage Allocation Extent Allocation Tree Block Block Block • Storage allocated in chunks Group 1 Group 2 Group 3 to create specific raid levels Chunk Tree: Logical -> Physical Map • Data and metadata can Chunk 1 Chunk 2 Chunk 3 have different raid levels RAID0 RAID1 RAID0 Disk 1 Disk 2 Disk 3 Disk 4 Free Free
  • 6. New Storage Technologies – Flash • Flash lifetime is limited by the number of write cycles to a cell • Small writes require internal read/modify/write cycles in the device A 1MB write might count as small Each small write may be amplified into multiple larger writes • Flash lifetime can be increased if the FS works together with the device
  • 7. Hints For the Storage • Discard and Trim allow the device to ignore blocks the FS isn’t using • Devices may be tiered internally Frequently modified or deleted blocks stay on faster cells Long lived blocks moved to less expensive storage • New APIs and standards will allow the FS to give hints to the device • Large arrays and high end flash can use the hints to improve performance • Low end flash can use hints to increase cell lifetime • Btrfs block group layout separates shorter lived metadata from data
  • 8. Discard/Trim • Trim and discard notify storage when we are done with a block • Btrfs supports both real-time trim and batched trim • Real-time trims blocks as they are freed • Batched trims all free space via an ioctl • Newer kernels will have less penalty for online discard
  • 9. Btrfs Restriper • Newly introduced in 3.3 • Advanced control over block groups and storage • Balance data across drives to select new RAID levels • Balance filtering by usage for thin provisioning • Ex: btrfs filesystem balance start -dconvert=raid1
  • 10. When Bad Things Happen to Good Data • Barrier bugs in Btrfs lead to most of the corruptions seen with kernels before v3.2 • Filesystem repair tool in btrfs-progs git Repairs extent allocation tree corruptions in place More repair modes in progress • Filesystem recovery tool from Josef Bacik Risk free – copies data out of the corrupt FS • Tree root history log to recover from many hardware errors Jumps back to older versions of the tree roots
  • 11. Larger Metadata Blocks • Btrfs btree uses key ordering to group related items into the same metadata block • COW tends to fragment the btree over time • Larger blocksizes provide very inexpensive btree defragmentation • Larger blocksizes reduce extent allocation overhead (fewer extents) • Larger blocksizes allow metadata on raid5/6
  • 12. Metadata Blocksizes and Writeback • Btrfs COW tends to allocate and free pages often • The Linux VM was keeping our stale pages on the LRU too long • In metadata heavy workloads Btrfs did many more reads than it should have • After fixing metadata caching and enabling larger blocks: Create 32 million empty files Btrfs – 170K files/sec (16KB metadata) Btrfs – 150K files/sec (4KB metadata after LRU fixes ) XFS – 115K files/sec Ext4 – 110K files/sec (256MB log)
  • 13.
  • 14.
  • 15.
  • 16. IO Animations • Ext4 is bottlenecked on reading the inode tables • Both XFS and Btrfs are CPU bound • XFS is walking forward through a series of distinct disk areas • Both XFS and Ext4 show heavy log activity • Btrfs is doing sequential writes and some random reads
  • 17. Scrub • Btrfs CRCs allow us to verify data stored on disk • CRC errors can be corrected by reading a good copy of the block from another drive • Scrubbing code scans the allocated data and metadata blocks • Any CRC errors are fixed during the scan if a second copy exists • Will be extended to track and offline bad devices
  • 18. Seed Devices • A readonly device can be used as a filesystem seed • Read/write devices can be added to store modifications • Changes to the writable devices are persistent across reboots • The readonly device can be removed at any time • Multiple read/write filesystems can be built from the same seed
  • 19. Embedded Systems • Btrfs is fairly friendly to small machines • Very little memory is pinned by the filesystem • Btrfs works very well overall on low end flash
  • 20. Thank You! • Chris Mason <chris.mason@oracle.com> • http://btrfs.wiki.kernel.org • http://oracle.com/linux • Free OTN SysAmdin Training Tuesday, April 10 8am-4pm Oracle Linux configuration and management (Btrfs included) Oracle Santa Clara Offices, CA