SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Divide and conquer – Shared disk
cluster file systems shipped with the
              Linux kernel
             Udo Seidel
Shared file systems
●   Multiple server access same data
●   Different approaches
    ●   Network based, e.g. NFS, CIFS
    ●   Clustered
        –   Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2
        –   Distributed parallel, e.g. Lustre, Ceph
History
●   GFS(2)
    ●   First version in the mid 90’s
    ●   Started on IRIX, later ported to Linux
    ●   Commercial background: Sistina and RedHat
    ●   Part of Vanilla Linux kernel since 2.6.19
●   OCFS2
    ●   OCFS1 for database files only
    ●   First version in 2005
    ●   Part of Vanilla Linux kernel since 2.6.16
Features/Challenges/More
●   As much as possible similar to local file
    systems
    ●   Internal setup
    ●   management
●   Cluster awareness
    ●   Data integrity
    ●   Allocation
Framework
●   Bridges the gap between one-node and cluster
●   3 main components
    ●   Cluster-ware
    ●   Locking
    ●   Fencing
Framework GFS2 (I)
●   Cluster-ware of general purpose
    ●   More flexible
    ●   More options/functions
    ●   More complexity
    ●   Configuration files in XML
●   Locking uses cluster framework too
●   system­config­cluster OR Conga OR vi
    & scp
Framework GFS2 (II)
# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>

<cluster config_version="3" name="gfs2">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>

<clusternodes>
<clusternode name="node0" nodeid="1" votes="1">

<fence/>
</clusternode>

<clusternode name="node1" nodeid="2" votes="1">
...

</cluster>
#
Framework OCFS2 (I)
●   Cluster-ware just for OCFS2
    ●   Less flexible
    ●   Less options/functions
    ●   Less complexity
    ●   Configuration file in ASCII
●   Locking uses cluster framework too
●   ocfs2console OR vi & scp
Framework OCFS2 (II)
# cat /etc/ocfs2/cluster.conf
node:

      ip_port = 7777
      ip_address = 192.168.0.1

      number = 0
      name = node0

      cluster = ocfs2
...

cluster:
      node_count = 2

#
Locking
●   Distributed Lock Manager (DLM)
●   Based on VMS-DLM
●   Lock modes
    ●   Exclusive Lock (EX)
    ●   Protected Read (PR)
    ●   No Lock (NL)
    ●   Concurrent Write Lock (CW) – GFS2 only
    ●   Concurrent Read Lock (CR) – GFS2 only
    ●   Protected Write (PR) – GFS2 only
Locking - Compatibility
                 Existing Lock
Requested Lock   NL       CR     CW    PR    PW    EX
NL               Yes      Yes    Yes   Yes   Yes   Yes
CR               Yes      Yes    Yes   Yes   Yes   No
CW               Yes      Yes    Yes   No    No    No
PR               Yes      Yes    No    Yes   No    No
PW               Yes      Yes    No    No    No    No
EX               Yes      No     No    No    No    No
Fencing
●   Separation of host and storage
    ●   Power Fencing
        –   Power switch, e.g. APC
        –   Server side, e.g. IPMI, iLO
        –   Useful in other scenarios
        –   Post-mortem more difficult
    ●   I/O fencing
        –   SAN switch, e.g. Brocade, Qlogic
        –   Possible to investigate “unhealthy” server
Fencing - GFS2
●   Both fencing methods
●   Part of cluster configuration
●   Cascading possible
Fencing - OCFS2
●   Only power fencing
    ●   Only self fencing
GFS2 – Internals (I)
●   Superblock
    ●   Starts at block 128
    ●   Expected data + cluster information
    ●   Pointers to master and root directory
●   Resource groups
    ●   Comparable to cylinder groups of traditional Unix
        file system
    ●   Allocatable from different cluster nodes -> locking
        granularity
GFS2 – Internals (II)
●   Master directory
    ●   Contains meta-data, e.g journal index, quota, ...
    ●   Not visible for ls and Co.
    ●   File system unique and cluster node specific files
●   Journaling file system
    ●   One journal per cluster node
    ●   Each journal accessible by all nodes (recovery)
GFS2 – Internals (III)
●   Inode/Dinode
    ●   Usual information, e.g. owner, mode, time stamp
    ●   Pointers to blocks: either data or pointer
    ●   Only one level of indirection
    ●   “stuffing”
●   Directory management via Extendible Hashing
●   Meta file statfs
    ●   statfs()
    ●   Tuning via sysfs
GFS2 – Internals (IV)
●   Meta files
    ●   jindex directory containing the journals
        –   journalX
    ●   rindex Resource group index
    ●   quota
    ●   per_node directory containing node specific files
GFS2 – what else
●   Extended attributes xattr
●   ACL’s
●   Local mode = one node access
OCFS2 – Internals (I)
●   Superblock
    ●   Starts at block 3 (1+2 for OCFS1)
    ●   Expected data + cluster information
    ●   Pointers to master and root directory
    ●   Up to 6 backups
        –   at pre-defined offset
        –   at 2^n Gbyte, n=0,2,4,6,8,10
●   Cluster groups
    ●   Comparable to cylinder groups of traditional Unix
        file system
OCFS2 – Internals (II)
●   Master or system directory
    ●   Contains meta-data, e.g journal index, quota, ...
    ●   Not visible for ls and Co.
    ●   File system unique and cluster node specific files
●   Journaling file system
    ●   One journal per cluster node
    ●   Each journal accessible by all nodes (recovery)
OCFS2 – Internals (III)
●   Inode
    ●   Usual information, e.g. owner, mode, time stamp
    ●   Pointers to blocks: either data or pointer
    ●   Only one level of indirection
●   global_inode_alloc
    ●   Global meta data file
    ●   inode_alloc node specific counterpart
●   slot_map
    ●   Global meta data file
    ●   Active cluster nodes
OCFS2 – Internals (IV)
●   orphan_dir
    ●   Local meta data file
    ●   Cluster aware deletion of files in use
●   truncate_log
    ●   Local meta data file
    ●   Deletion cache
OCFS2 – what else
●   Two versions: 1.2 and 1.4
    ●   Mount compatible
    ●   Framework not network compatible
    ●   New features disabled per default
●   For 1.4:
    ●   Extended attributes xattr
    ●   Inode based snapshotting
    ●   preallocation
File system management
●   Known/expected tools + cluster details
    ●   mkfs
    ●   mount/umount
    ●   fsck
●   File system specific tools
    ●   gfs2_XXXX
    ●   tunefs.ocfs2, debugfs.ocfs2
GFS2 management (I)
●   File system creation needs additional
    information
    ●   Cluster name
    ●   Unique file system identifier (string)
    ●   Optional:
        –   Locking mode to be used
        –   number of journals
    ●   Tuning by changing default size for journals,
        resource groups, ...
GFS2 management (II)
●   Mount/umount
    ●   No real syntax surprise
    ●   First node checks all journals
    ●   Enabling ACL, quota, single node mode
GFS2 management (III)
●   File system check
    ●   Journal recovery of node X by node Y
    ●   Done by one node
    ●   file system offline anywhere else
    ●   Known phases
        –   Journals
        –   Meta data
        –   References: data blocks, inodes
GFS2 tuning (I)
●   gfs2_tool
    ●   Most powerful
        –   Display superblock
        –   Change superblock settings (locking mode, cluster name)
        –   List meta data
        –   freeze/unfreeze file system
        –   Special attributes, e.g. appendonly, noatime
    ●   Requires file system online (mostly)
GFS2 tuning (II)
●   gfs2_edit
    ●   Logical extension of gfs2_tool
    ●   More details, e.g. node-specific meta data, block
        level
●   gfs2_jadd
    ●   Different sizes possible
    ●   No deletion possible
    ●   Can cause data space shortage
GFS2 tuning (III)
●   gfs2_grow
    ●   Needs space in meta directory
    ●   Online only
    ●   No shrinking
OCFS2 management (I)
●   File system creation
    ●   no additional information needed
    ●   Tuning by optional parameters
●   Mount/umount
    ●   No real syntax surprise
    ●   First node checks all journals
    ●   Enabling ACL, quota, single node mode
OCFS2 management (II)
●   File system check
    ●   Journal recovery of node X by node Y
    ●   Done by one node
    ●   file system offline anywhere else
    ●   Fixed offset of superblock backup handy
    ●   Known phases
        –   Journals
        –   Meta data
        –   References: data blocks, inodes
OCFS2 tuning (I)
●   tunefs.ocfs2
    ●   Display/change file system label
    ●   Display/change number of journals
    ●   Change journal setup, e.g.size
    ●   Grow file system (no shrinking)
    ●   Create backup of superblock
    ●   Display/enable/disable specific file system features
        –   Sparse files
        –   “stuffed” inodes
OCFS2 tuning (II)
●   debugfs.ocfs2
    ●   Display file system settings, e.g. superblock
    ●   Display inode information
    ●   Access meta data files
Volume manager
●   Necessary to handle more than one
    LUN/partition
●   Cluster-aware
●   Bridge feature gap, e.g. volume based
    snapshotting
●   CLVM
●   EVMS – OCFS2 only
Key data - comparison
                                         GFS2                          OCFS2
Maximum # of cluster nodes   Supported 16 (theoretical: 256)             256

journaling                                Yes                            Yes
Cluster-less/local mode                   Yes                            Yes
Maximum file system size       25 TB (theoretical: 8 EB)       16 TB (theoretical: 4 EB)
Maximum file size              25 TB (theoretical: 8 EB)       16 TB (theoretical: 4 EB)
POSIX ACL                                 Yes                            Yes
Grow-able                           Yes/online only             Yes/online and offline
Shrinkable                                 No                             No
Quota                                     Yes                            Yes
O_DIRECT                              On file level                      Yes
Extended attributes                       Yes                            Yes
Maximum file name length                  255                            255
File system snapshots                      No                             No
Summary
●   GFS2 longer history than OCFS2
●   OCFS2 setup simpler and easier to maintain
●   GFS2 setup more flexible and powerful
●   OCFS2 getting close to GFS2
●   Dependence on choice of Linux vendor
References
http://sourceware.org/cluster/gfs/
http://www.redhat.com/gfs/
http://oss.oracle.com/projects/ocfs2/
http://sources.redhat.com/cluster/wiki/
http://sourceware.org/lvm2/
http://evms.sourceforge.net/
Thank you!

Contenu connexe

Tendances

Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
Andrey Vagin
 
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Andrey Vagin
 
Lightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesLightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best Practices
Werner Fischer
 

Tendances (20)

MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Cluster filesystems
Cluster filesystemsCluster filesystems
Cluster filesystems
 
Secure lustre on openstack
Secure lustre on openstackSecure lustre on openstack
Secure lustre on openstack
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
 
Drbd
DrbdDrbd
Drbd
 
Containers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelContainers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux Kernel
 
Namespaces in Linux
Namespaces in LinuxNamespaces in Linux
Namespaces in Linux
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Configuring Syslog by Octavio
Configuring Syslog by OctavioConfiguring Syslog by Octavio
Configuring Syslog by Octavio
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
 
Lightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesLightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best Practices
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vos
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2
 
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 

Similaire à Linuxkongress2010.gfs2ocfs2.talk

INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Thomas Uhl
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File Systems
Shu-Yu Fu
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Pythian
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in Filesystems
Conferencias FIST
 

Similaire à Linuxkongress2010.gfs2ocfs2.talk (20)

Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
 
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelOSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File Systems
 
Osdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkOsdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talk
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in Filesystems
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
FUSE Filesystems
FUSE FilesystemsFUSE Filesystems
FUSE Filesystems
 
Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Lt2013 glusterfs.talk
Lt2013 glusterfs.talkLt2013 glusterfs.talk
Lt2013 glusterfs.talk
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
adp.ceph.openstack.talk
adp.ceph.openstack.talkadp.ceph.openstack.talk
adp.ceph.openstack.talk
 

Plus de Udo Seidel (8)

ceph openstack dream team
ceph openstack dream teamceph openstack dream team
ceph openstack dream team
 
kpatch.kgraft
kpatch.kgraftkpatch.kgraft
kpatch.kgraft
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
Lt2013 uefisb.talk
Lt2013 uefisb.talkLt2013 uefisb.talk
Lt2013 uefisb.talk
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talk
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talk
 
Osdc2012 xtfs.talk
Osdc2012 xtfs.talkOsdc2012 xtfs.talk
Osdc2012 xtfs.talk
 
Linuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkLinuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talk
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Linuxkongress2010.gfs2ocfs2.talk

  • 1. Divide and conquer – Shared disk cluster file systems shipped with the Linux kernel Udo Seidel
  • 2. Shared file systems ● Multiple server access same data ● Different approaches ● Network based, e.g. NFS, CIFS ● Clustered – Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2 – Distributed parallel, e.g. Lustre, Ceph
  • 3. History ● GFS(2) ● First version in the mid 90’s ● Started on IRIX, later ported to Linux ● Commercial background: Sistina and RedHat ● Part of Vanilla Linux kernel since 2.6.19 ● OCFS2 ● OCFS1 for database files only ● First version in 2005 ● Part of Vanilla Linux kernel since 2.6.16
  • 4. Features/Challenges/More ● As much as possible similar to local file systems ● Internal setup ● management ● Cluster awareness ● Data integrity ● Allocation
  • 5. Framework ● Bridges the gap between one-node and cluster ● 3 main components ● Cluster-ware ● Locking ● Fencing
  • 6. Framework GFS2 (I) ● Cluster-ware of general purpose ● More flexible ● More options/functions ● More complexity ● Configuration files in XML ● Locking uses cluster framework too ● system­config­cluster OR Conga OR vi & scp
  • 7. Framework GFS2 (II) # cat /etc/cluster/cluster.conf <?xml version="1.0" ?> <cluster config_version="3" name="gfs2"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node0" nodeid="1" votes="1"> <fence/> </clusternode> <clusternode name="node1" nodeid="2" votes="1"> ... </cluster> #
  • 8. Framework OCFS2 (I) ● Cluster-ware just for OCFS2 ● Less flexible ● Less options/functions ● Less complexity ● Configuration file in ASCII ● Locking uses cluster framework too ● ocfs2console OR vi & scp
  • 9. Framework OCFS2 (II) # cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.0.1 number = 0 name = node0 cluster = ocfs2 ... cluster: node_count = 2 #
  • 10. Locking ● Distributed Lock Manager (DLM) ● Based on VMS-DLM ● Lock modes ● Exclusive Lock (EX) ● Protected Read (PR) ● No Lock (NL) ● Concurrent Write Lock (CW) – GFS2 only ● Concurrent Read Lock (CR) – GFS2 only ● Protected Write (PR) – GFS2 only
  • 11. Locking - Compatibility Existing Lock Requested Lock NL CR CW PR PW EX NL Yes Yes Yes Yes Yes Yes CR Yes Yes Yes Yes Yes No CW Yes Yes Yes No No No PR Yes Yes No Yes No No PW Yes Yes No No No No EX Yes No No No No No
  • 12. Fencing ● Separation of host and storage ● Power Fencing – Power switch, e.g. APC – Server side, e.g. IPMI, iLO – Useful in other scenarios – Post-mortem more difficult ● I/O fencing – SAN switch, e.g. Brocade, Qlogic – Possible to investigate “unhealthy” server
  • 13. Fencing - GFS2 ● Both fencing methods ● Part of cluster configuration ● Cascading possible
  • 14. Fencing - OCFS2 ● Only power fencing ● Only self fencing
  • 15. GFS2 – Internals (I) ● Superblock ● Starts at block 128 ● Expected data + cluster information ● Pointers to master and root directory ● Resource groups ● Comparable to cylinder groups of traditional Unix file system ● Allocatable from different cluster nodes -> locking granularity
  • 16. GFS2 – Internals (II) ● Master directory ● Contains meta-data, e.g journal index, quota, ... ● Not visible for ls and Co. ● File system unique and cluster node specific files ● Journaling file system ● One journal per cluster node ● Each journal accessible by all nodes (recovery)
  • 17. GFS2 – Internals (III) ● Inode/Dinode ● Usual information, e.g. owner, mode, time stamp ● Pointers to blocks: either data or pointer ● Only one level of indirection ● “stuffing” ● Directory management via Extendible Hashing ● Meta file statfs ● statfs() ● Tuning via sysfs
  • 18. GFS2 – Internals (IV) ● Meta files ● jindex directory containing the journals – journalX ● rindex Resource group index ● quota ● per_node directory containing node specific files
  • 19. GFS2 – what else ● Extended attributes xattr ● ACL’s ● Local mode = one node access
  • 20. OCFS2 – Internals (I) ● Superblock ● Starts at block 3 (1+2 for OCFS1) ● Expected data + cluster information ● Pointers to master and root directory ● Up to 6 backups – at pre-defined offset – at 2^n Gbyte, n=0,2,4,6,8,10 ● Cluster groups ● Comparable to cylinder groups of traditional Unix file system
  • 21. OCFS2 – Internals (II) ● Master or system directory ● Contains meta-data, e.g journal index, quota, ... ● Not visible for ls and Co. ● File system unique and cluster node specific files ● Journaling file system ● One journal per cluster node ● Each journal accessible by all nodes (recovery)
  • 22. OCFS2 – Internals (III) ● Inode ● Usual information, e.g. owner, mode, time stamp ● Pointers to blocks: either data or pointer ● Only one level of indirection ● global_inode_alloc ● Global meta data file ● inode_alloc node specific counterpart ● slot_map ● Global meta data file ● Active cluster nodes
  • 23. OCFS2 – Internals (IV) ● orphan_dir ● Local meta data file ● Cluster aware deletion of files in use ● truncate_log ● Local meta data file ● Deletion cache
  • 24. OCFS2 – what else ● Two versions: 1.2 and 1.4 ● Mount compatible ● Framework not network compatible ● New features disabled per default ● For 1.4: ● Extended attributes xattr ● Inode based snapshotting ● preallocation
  • 25. File system management ● Known/expected tools + cluster details ● mkfs ● mount/umount ● fsck ● File system specific tools ● gfs2_XXXX ● tunefs.ocfs2, debugfs.ocfs2
  • 26. GFS2 management (I) ● File system creation needs additional information ● Cluster name ● Unique file system identifier (string) ● Optional: – Locking mode to be used – number of journals ● Tuning by changing default size for journals, resource groups, ...
  • 27. GFS2 management (II) ● Mount/umount ● No real syntax surprise ● First node checks all journals ● Enabling ACL, quota, single node mode
  • 28. GFS2 management (III) ● File system check ● Journal recovery of node X by node Y ● Done by one node ● file system offline anywhere else ● Known phases – Journals – Meta data – References: data blocks, inodes
  • 29. GFS2 tuning (I) ● gfs2_tool ● Most powerful – Display superblock – Change superblock settings (locking mode, cluster name) – List meta data – freeze/unfreeze file system – Special attributes, e.g. appendonly, noatime ● Requires file system online (mostly)
  • 30. GFS2 tuning (II) ● gfs2_edit ● Logical extension of gfs2_tool ● More details, e.g. node-specific meta data, block level ● gfs2_jadd ● Different sizes possible ● No deletion possible ● Can cause data space shortage
  • 31. GFS2 tuning (III) ● gfs2_grow ● Needs space in meta directory ● Online only ● No shrinking
  • 32. OCFS2 management (I) ● File system creation ● no additional information needed ● Tuning by optional parameters ● Mount/umount ● No real syntax surprise ● First node checks all journals ● Enabling ACL, quota, single node mode
  • 33. OCFS2 management (II) ● File system check ● Journal recovery of node X by node Y ● Done by one node ● file system offline anywhere else ● Fixed offset of superblock backup handy ● Known phases – Journals – Meta data – References: data blocks, inodes
  • 34. OCFS2 tuning (I) ● tunefs.ocfs2 ● Display/change file system label ● Display/change number of journals ● Change journal setup, e.g.size ● Grow file system (no shrinking) ● Create backup of superblock ● Display/enable/disable specific file system features – Sparse files – “stuffed” inodes
  • 35. OCFS2 tuning (II) ● debugfs.ocfs2 ● Display file system settings, e.g. superblock ● Display inode information ● Access meta data files
  • 36. Volume manager ● Necessary to handle more than one LUN/partition ● Cluster-aware ● Bridge feature gap, e.g. volume based snapshotting ● CLVM ● EVMS – OCFS2 only
  • 37. Key data - comparison GFS2 OCFS2 Maximum # of cluster nodes Supported 16 (theoretical: 256) 256 journaling Yes Yes Cluster-less/local mode Yes Yes Maximum file system size 25 TB (theoretical: 8 EB) 16 TB (theoretical: 4 EB) Maximum file size 25 TB (theoretical: 8 EB) 16 TB (theoretical: 4 EB) POSIX ACL Yes Yes Grow-able Yes/online only Yes/online and offline Shrinkable No No Quota Yes Yes O_DIRECT On file level Yes Extended attributes Yes Yes Maximum file name length 255 255 File system snapshots No No
  • 38. Summary ● GFS2 longer history than OCFS2 ● OCFS2 setup simpler and easier to maintain ● GFS2 setup more flexible and powerful ● OCFS2 getting close to GFS2 ● Dependence on choice of Linux vendor