Linux and Open Source Software have always played a crucial role in data centers to provide storage in various ways. In this talk, Lenz will give an overview of how storage on Linux has evolved over the years, from local file systems to scalable file systems, logical volume managers and cluster file systems to today's modern file systems and distributed, parallel and fault-tolerant file systems.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
The Evolution of Storage on Linux - FrOSCon - 2015-08-22
1. The Evolution of Storage
on Linux
Lenz Grimmer <lenz.grimmer@it-novum.com>
FrOSCON 2015, Sankt Augustin
22. August 2015
2. 2
Agenda
A trip down memory lane (pun intended)
Overview of how storage on Linux has evolved
Local file systems and related concepts/technologies
Network Services
Distributed / Cluster filesystems
3. 3
Introduction
40+ file systems in /fs/
Focus on the most popular/widely used systems
Primary focus on the software side
High-level Descriptions only
4. 4
Noteworthy Observations / Conclusions
The role of SourceForge.net today
Distribution kernels vs. mainline Linux
Honorable mention: Christoph Hellwig
Don‘t miss his talk about the Linux Storage Stack tomorrow (14:00, HS6)
Big Thanks to: LWN, Kernelnewbies.org, Thorsten Leemhuis
(Heise) and Wikipedia
6. 6
MINIX file system
While developing Linux in 1991, Linus required some form of
persistent storage
A Minix-compatible file system was the canonical choice:
Well-documented, robust
Exchange data with the host OS (and vice versa)
Severely limited
Max. file/filesystem size: 64MB (16bit block addresses)
14 char file names
Only one time stamp (mtime)
7. 7
Virtual File System Switch (VFS)
Abstraction / indirection layer to route file oriented system calls to
necessary functions in the physical filesystem code to do the I/O
Eased the addition of new file systems
Initially written by Chris Provenzano
Integrated into Linux 0.96
Defines a set of functions that every filesystem has to implement
Three kinds of objects: filesystems, inodes, and open files
8. 8
Extended File System (ext)
Designed by Rémy Card
Max. file/filesystem size: 2 GB, max. file name size was 255 chars
Metadata structure inspired by the traditional Unix File System
(UFS)
Added to Linux 0.96c in April 1992
Issues remained (bad performance, missing time stamps,
fragmentation)
9. 9
Second Extended File System (ext2)
Also implemented by Rémy Card
Introduced in Linux Kernel 0.99 (January 1993)
Designed with extensibility in mind
Adopted advanced ideas from other file systems (e.g. BSD Fast File System),
e.g. mtime/ctime/atime, file attributes, BSD/SysV semantics, different block
sizes, immutable/append-only files
Initially supported file/file systems sizes up to 2TB (limitation of the block
device layer)
Kernel version 2.6.17 (March 2006) extended max. file system size to 32TB
(using 8kB Blocks)
10. 10
FAT/MSDOS
Added to Linux in 1992/1993 by Werner Almesberger
VFAT support was later developed by Gordon Chaffee
VFAT filesystem is compatible with Windows 95/NT long filenames on the
FAT filesystem
Initially called xmsdos
Patches for Linux 1.2.x and 1.3.x.
As of Linux 1.3.60, the vfat filesystem is part of the Linux kernel distribution
Mtools as a userland-only alternative
11. 11
NTFS
NTFS driver for Linux by Martin von Löwis (started around 1996)
Legato Systems later sponsored Anton Altaparmakov to further
develop NTFS on Linux since June 2001
Read-only mode only, with no fault-tolerance supported
NFTS-TNG replaced old NTFS driver in Linux 2.5.11 (April 29th,
2002)
NTFS-3G (FUSE-based) by Tuxera (read-write support)
13. 13
Fsck vs. Journaling
Unclean unmounts, too many mount counts, or remounts after
a long time period triggered file system checks
Disk drives got bigger
A Journaling file system keeps track of changes not yet
committed to the file system's main part in a Journal
Keep track of just metadata changes or data as well
Several file systems were developed in parallel, to alleviate this
shortcoming of ext2, namely ext3, XFS, JFS and ReiserFS.
14. 14
Journaling Block Device layer (JBD)
JBD established as a filesystem-independent service, to be used
by any file system
First incarnation of JBD developed by Stephen C. Tweedie
together with the ext3 file system
OCFS2 and later ext4 also used JBD and it’s successor JBD2
15. 15
Third extended filesystem (ext3)
Originally released in September 1999
Written by Stephen Tweedie for the 2.2 branch
Ported to 2.4 kernels by Peter Braam, Andreas Dilger, Andrew
Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie
Merged with the mainline Linux kernel 2.4.15 (November 2001)
Basically ext2 with journaling capabilities, easy conversion
Max filesystem size: 8TB, Max 32k subdirs/directory
16. 16
IBM JFS
Rooted in AIX and OS/2 Warp Server (new design in 1995)
Port to Linux started in December 1999 (Dave Kleikamp, Steve Best)
Uses own journaling implementation (metadata only)
Max volume size: 32PB, Max file size: 4PB
Later ported to AIX 5L as JFS2 (April 2001)
JFS 0.0.1 released in Feb. 2000., 0.1.0 (Beta) in August 2000
Version 1.0.0 was released in June 2001
Kernel module since 2.4.18pre9-ac4, Version 1.1.0 was included by Marcelo
Tosatti in Linux 2.4.20.
17. 17
ReiserFS
Early supported by SuSE, Introduced in version 2.4.1 (2001)
The first journaling file system to be included in mainline
Max volume size: 16TB
Based on B+ trees
Metadata-only journaling (block journaling since 2.6.8)
Online resizing
Tail packing block suballocation
Reiser4 still under active development (Edward Shishkin)
18. 18
SGI XFS
64-bit journaling file system created by Silicon Graphics
SGI IRIX since 1994, GPLed in 2000
Version 1.0 for Linux in May 2001 as Patch against 2.4.2
Merged in 2.6.x and 2.4.25 (Feb 2004)
Steve Lord, Russell Cattelan, Nathan Scott, Jim Mostek
Advanced features, high performance
Max volume size: 16EB
20. 20
The need for Logical Volume Management
Initially, Linux could only address disks/partitions
Changes to the layout required downtime and shuffling of data
Logical Volume Management abstracts physical disk drives
First incarnation of Linux LVM was introduced in Kernel version
2.4
Heinz Mauelshagen wrote the original LVM code in 1998,
inspired by HP-UX's volume manager.
21. 21
Device Mapper (DM)
A kernel framework for mapping physical block devices onto higher-
level virtual block devices
Added in Linux 2.6
Passes data from a virtual block device, which is provided by the
device mapper itself, to another block device
Pluggable design
Data can be also modified in transition
Forms the foundation of LVM2/EVMS, RAID and dm-crypt disk
encryption and many other useful features
22. 22
DM Multipath (DM-MPIO)
Consists of kernel components and user-space components
Provides input-output (I/O) fail-over and load-balancing within Linux
for block devices
Handles the rerouting of block I/O to an alternate path in the event of
a path failure
Can also balance the I/O load across all of the available paths in Fibre
Channel (FC) or iSCSI SAN environments
Started as part of a patchset created by Joe Thornber, later
maintained by Alasdair G Kergon at Red Hat. Christophe Varoqui
maintains the userland multipath tools
23. 23
DM-Cache
Allows a fast device (e.g. an SSD) to be used as a cache for a slower device
(e.g. a rotating disk)
Different policy plugins can be used to change the algorithms used to select
which blocks are promoted, demoted, cleaned etc.
Supports writeback and writethrough modes
Requires three physical storage devices to separately store actual data,
cache data and required metadata
Joe Thornber, Heinz Mauelshagen and Mike Snitzer
Inclusion into the Linux mainline kernel version 3.9, released on April 28,
2013
24. 24
LVM2
Based on DM
Flexible storage management
Add/remove disks
Resize/move logical volumes
Move LVs between PVs
Span volumes across multiple physical devices
RAID
Thin provisioning
Cluster Volume Manager
25. 25
IBM EVMS
IBM-sponsored effort to provide volume management services for
Linux
A single, unified system for handling all storage management tasks
Despite many of the features and GUI management tools found in
EVMS, LVM2 was preferred
As a result, IBM dropped their kernel driver and reworked their tools
to work with LVM2 instead
Development stopped in 2006
27. 27
NFS
Rick Sladkey original author of the NFS client and also ported the NFS server
and the RPC library code. Doug Quale helped extending the kernel to
support networking filesystems
NFS Version 2 since 1.2 kernel series
Kernel 2.2.18 a major milestone: mixing Linux NFS with other operating
systems' NFS, use file locking reliably over NFS, and NFS Version 3.
NFS Versions 2, 3, and 4 are supported on 2.6 and later kernels. Version 4.1
(Client) at least kernel 2.6.31
NFSv4 for Linux has been under development at CITI and NetApp since 2001
28. 28
Samba
A free-software re-implementation of the SMB/CIFS networking protocol
Andrew Tridgell started development of Samba in 1992, Jeremy Allison
joined early on
Volker Lendecke founded SerNet in 1997, to provide commercial support
Version 3 (2003): file and print services for Microsoft Windows clients and can
integrate with a Windows NT 4.0 server domain, either as a Primary Domain
Controller (PDC) or as a domain member
Samba4 installations can act as an Active Directory domain controller or
member server, at Windows 2008 domain and forest functional levels.
29. 29
SMB vs.CIFS
SMB "server message block" and CIFS "common internet file system"
are protocols. CIFS is the extension of the SMB protocol
“smbfs” was an older FS originated from the Samba project, heavily
coupled with the Samba tools (smb.conf, smbmount, etc.). Removed
in Linux 2.6.27
CIFS VFS was added to mainline Linux kernels in 2.5.42 Supports
advanced network file system features such as locking, Unicode
(advanced internationalization), hardlinks, dfs (hierarchical,
replicated name space), distributed caching and uses native TCP
names. All key network functions implemented in kernel
31. 31
Fourth Extended Filesystem (ext4)
Advanced version of ext3, led by Ted Tso et al
Incorporated scalability and reliability enhancements for supporting
large filesystems up to 1EB.
First experimental support for ext4 was merged into Linux 2.6.19,
which was released on 29 November 2006.
Ext4 was marked as experimental until Linux 2.6.27
Starting with 2.6.28 (December 2008), ext4 was marked as stable
New extent format reduced metadata overhead (RAM, IO for access,
transactions)
32. 32
Btrfs
Chris Mason (Oracle) in 2007
COW (Snapshots)
Checksums, Compression
RAID, Volume management
Conversion of ext3/4 file systems
Merged into mainline Linux 2.6.29 (March 2009)
Florian Winkler talks about Btrfs today (11:15, HS7)
33. 33
ZFS
Filesystem and logical volume manager combined
Designed and implemented at Sun Microsystems (Jeff Bonwick, Matthew
Ahrens)
Development started in 2001,officially announced in 2004
128bit, COW, Snapshots, Deduplication, RAID
OpenSolaris (CDDL)
Early port based on FUSE
Kernel modules based OpenZFS (2013)
Not included in mainline Linux due to license incompatibilities
35. 35
Network Block Device (NBD)
Remotely access a block device attached to another system
Userspace Server/Client, Client kernel module
Issues arise if network goes down or server crashes
Markus Pargmann talks about NBD on Sunday (16:30, HS6)
36. 36
Distributed Replicated Block Device (DRBD)
A shared-nothing, synchronously replicated block device
“RAID1 over Network”
Writes to the primary node are transferred to the lower-level block device and
simultaneously propagated to the secondary node
The secondary node then transfers data to its corresponding lower-level block
device. All read I/O is performed locally
Fail-over capabilities (Secondary/Primary)
Lars Ellenberg and Philipp Reisner originally submitted code in July 2007
DRBD was merged on 8 December 2009 during the "merge window" for Linux
kernel version 2.6.33
38. 38
OCFS/OCFS2
Shared disk file system by Oracle
Main focus of OCFS was to accommodate Oracle clustered databases,
not POSIX-compliant
OCFS2 designed as a Linux filesystem from scratch
On-disk filesystem implementation heavily inspired by ext3, uses JBD
for journaling
OCFS2 integrated into version 2.6.16 of mainline Linux
Max Volume/File Size 4PB (currently limited to 16TB)
Trivia question: what feature do OCFS2 and Btrfs have in common?
39. 39
GFS/GFS2
Shared disk filesystem, allows concurrent access to the same block storage
Development of GFS began in 1995 and was originally developed by
University of Minnesota professor Matthew O'Keefe and a group of students
Originally for SGI IRIX, ported to Linux in 1998
Acquired by Sistina in 2000, turned into proprietary product
OpenGFS fork
Red Hat acquired Sistina in 2003 and released GFS2 under GPL in June 2004
GFS2 and the DLM merged into Linux 2.6.19 (29 November 2006)
40. 40
Storage Requirements and Challenges
Amount of data to be stored grows exponentially
Today, Storage has to be:
Fault tolerant, reliable
Scalable without limitations or service interruptions
Distributable
Easy to manage / automate
Previous approaches do not address these requirements
42. 42
GlusterFS
Aggregates various storage servers over Ethernet or Infiniband RDMA
interconnect into one large parallel network file system
Storage bricks export local file systems as volumes
GlusterFS clients create composite virtual volumes from multiple remote
servers using stackable „translators“
Translators provide Mirroring, Replication, Striping, etc.
Final volume mounted by client host using its own native protocol via FUSE,
using NFS v3 protocol (via built-in server translator)
Originally developed by Gluster, Inc., which was acquired by Red Hat in 2011
43. 43
Ceph
Initially created by Sage Weil, founded Inktank in 2012
First release in July 2012
Object, block, and file storage from a single distributed computer cluster
Reliable autonomic distributed object store (RADOS)
RADOS Block Device (RBD), Snapshots
RadosGW provides REST API (Amazon S3/OpenStack Swift)
Completely distributed without a single point of failure
Replicates data for fault tolerance (CRUSH)
Ceph client code was merged into mainling Linux version 2.6.34
Red Hat acquired Inktank in April 2014
44. 44
Lustre
Parallel distributed file system, generally used for large-scale cluster computing
Widely used in TOP500 supercomputers
Max. volume size: 100 PB (production), over 16 EB (theoretical)
Max. file size: 2.5 PB (ext4), 16 EB (ZFS)
Started as a research project in 1999 by Peter Braam at CMU, who founded Cluster Filesystems Inc. in
2001 to work on Intermezzo, Coda and Lustre
First installed in March 2003 on the MCR Linux Cluster (Lawrence Livermore National Laboratory).
Lustre 1.0.0 was released in December 2003.
Acquired by Sun Microsystems in 2007
Oracle acquired Sun in 2010 and discontinued the development
Whamcloud->Intel, OpenScalabaleFilesystems Inc. (OpenSFS), Xyratex Inc.
45. 45
Shameless plug: openATTIC
Unified Storage: manage XFS, ZFS, Btrfs, NFS, Samba
Modern GUI (AngularJS/Boostrap)
REST API
Built-in Monitoring
Clustering (Pacemaker/Corosync, DRBD)
http://www.openattic.org/
Find us in the exhibition hall
46. 46
PHP-ENTWICKLER (M/W) mit
Linux Know-how
Sie entwickeln leidenschaftlich gerne und fühlen sich im
Open Source-Umfeld Zuhause?
Dann sollten wir uns kennenlernen!
Diese Aufgaben erwarten Sie bei uns…
• Entwicklung unseres Systemmonitoring-Tools
openITCOCKPIT für Frontend und/oder Backend
• Konzeption und Realisierung von Projekten in
Teamarbeit
• Testing der entwickelten Anwendungen
• Pflege und Ausbau der bestehenden Entwicklungs- und
Testumgebung
Weitere Informationen finden Sie unter:
www.it-novum.com/karriere
Gesucht: PHP-Entwickler (m/w) mit Linux Know-How