More Related Content Similar to I can\'t believe this is butter - A Tour of btrfs (20) I can\'t believe this is butter - A Tour of btrfs1. ORACLE
PRODUCT
LOGO
Presented at
I can’t believe this is butter! A tour of btrfs
Avi Miller
LOGO
Principal Program Manager
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
2. The Btrfs Filesystem
• Jointly developed by a number of companies
– Oracle, Red Hat, Fujitsu, Intel, SUSE and many others
• All data and metadata is written via copy-on-write
• CRCs maintained for all metadata and data
• Efficient writable snapshots
• Multi-device support
• Online resize and defrag
• Transparent compression
• Efficient storage for small files
• SSD optimisations and TRIM support
2 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
3. Btrfs Progress
• Extensive performance and stability fixes
• Significant code cleanups
• Efficient free space caching across reboots
• Delayed metadata insertion and deletion
• Background scrubbing
• New LZO compression mode
• New Snappy compression mode in development
• Batched discard (via ioctl)
• Per-inode flags to control COW, compression
• Automatic file defrag option
3 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
4. Logging improvements
• Btrfs fsync log was rewriting some items over and over
• New code from Fujitsu bumps the metadata generation
numbers inside a transaction
• Cuts down log traffic by 75%
• Will go into 3.2 merge window
4 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
5. Metadata Fragmentation
• Btrfs btree uses key ordering to group related items into
the same metadata block
• COW tends to fragment the btree over time
• Larger block sizes lower metadata overhead and
improve performance
• Larger block sizes provide inexpensive btree
defragmentation
• E.g.: Intel 120GB MLC drive:
– 4KB random reads: 78MB/s
– 8KB random reads: 137MB/s
– 16KB random reads: 186MB/s
• Code queued up for Linux 3.3 allows larger block sizes
5 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
6. Scrubbing
• Btrfs CRCs allow us to verify data stored on disk
• CRC errors can be corrected by reading a good copy of
the block from another drive
• New scrubbing code scans the allocated data and
metadata blocks (Arne Jansen)
• Any CRC errors are fixed during the scan if a second
copy exists
• Will be extended to track and offline bad devices
• First Demo: btrfs filesystem creation and scrubbing
6 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
7. Discard/Trim
• Trim and discard notify storage that we’re done with a
block
• Btrfs now supports both real-time trim and batched
• Real-time trims blocks as they are freed
• Batched trims all free space via an ioctl
7 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
8. Drive swapping
• GSOC project
• Current raid rebuild works via rebalance code
• Moves all extents to new locations as it rebuilds
• Drive swapping replaces an existing drive in-place
• Uses extent-allocation map to limit bytes read
• Can also restripe between RAID levels
– Pull request sent this morning!
8 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
9. Efficient backups
• Advanced btrfs send/receive tool in development (Jan
Schmidt)
• Transmits in neutral format so corruptions are not
duplicated
9 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
10. Embedded Systems
• Btrfs is fairly friendly to small machines
• Btrfs is not quite as friendly to small disks
– But this is getting better
• Btrfs works very well overall on low-end flash
10 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
11. RAID5/6
• Initial implementation from Intel some time ago
• Merge pending completion of fsck work
• Will also add triple mirroring
• Mixed RAID modes for metadata and data are included
11 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
12. When Bad Things Happen to Good Data
• Beta filesystem recovery tool from Josef Bacik
– Risk-free: copies data out of the corrupt FS
• Tree root history log to recover from many hardware
errors
• New fsck releases on the way to replace in place
– Chris Mason is talking on btrfs in L.A. on Saturday *cough*
• git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-
progs.git recovery-beta
• Second Demo: btrfs filesystem recovery
12 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
13. Billions of Files?
• Dramatic different in filesystem writeback patterns
• Sequential I/O still matters on modern SSDs
• Btrfs COW allows flexible writeback patterns
• Ext4 and XFS tend to get stuck behind their logs
– XFS has improved significantly
• Btrfs tends to produce more sequential writes and more
random reads
– Writeback regression in current kernels: we’re working on it!
13 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
14. File Creation Benchmark Summary
• Btrfs duplicates metadata by default
– 2x the writes
• Btrfs stores the file name three times
• Btrfs and XFS are CPU-bound on
SSD
14 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
15. File Creation Throughput
15 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
16. IOPs
16 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
17. I/O Animations
• Ext4 is seeking between a large number of disk areas
• XFS is walking forward through a series of distinct areas
• Both XFS and Ext4 show heavy log activity
• Btrfs is doing sequential writes and some random reads
• http://oss.oracle.com/~mason/seekwatcher/
17 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
18. Root filesystem snapshots
yum-plugin-fs-snapshot
• Yum plugin to trigger a snapshot for all upgrades/installs
• Can be used as an instant rollback mechanism
• Currently supports btrfs snapshots
• Requires btrfs root
• Demo: convert / to btrfs and yum-plugin-fs-snapshot
18 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
19. Thank You!
• Avi Miller: avi.miller@oracle.com
• http://btrfs.wiki.kernel.org
• Oracle Linux 6.2
– http://oracle.com/linux
– http://edelivery.oracle.com/linux
• UEK2 Beta
– http://public-yum.oracle.com/beta/
– http://oss.oracle.com/git/linux-2.6-unbreakable-beta.git/
19 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
20. Q&A
20 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
21. 21 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7
22. 22 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Insert Informaion Protection Policy Classification from Slide 7