7. Use Case
Small number of files (30~100 per VM)
Files either very small (~a few KBs), or very
large (many GBs)
SAN storage is the underlying substrate.
All storage exported by these storage systems
is shared among all ESX servers
8. Design Goals
Metadata overhead should be very low
VM IO throughput and latency should be as
good as directly attached raw device
A clustered lock manager for moderating
access to files among ESX servers
Help VM deterministically react to transient
and non-transient SAN events and error
conditions.
10. VMFS Architecture
A volume is an aggregation of resources and on-disk
locks.
A resource is either an inode, a file block, a sub-
block or an indirect block.
Each lock moderates access to a subset of resources.
Hosts negotiate access to resource by acquiring
relevant locks.
VMFS = a clustered lock manager + a resource
manager + a journaling module + a data mover + a
VM IO manager + POSIX system call frantend
11. VMKernel Logical Volume
VMFS are by default created inside VMKernel
logical volumes. VMKernel logical volumes can
be spanned across multiple devices.
13. Four Resources
file blocks
sub-blocks
pointer blocks
file descriptors
Resources are grouped together into collections called
CLUSTERs and clusters are further grouped together
into CLUSTER GROUPS.
14. Block Mapping
Packed inside inode
Sub block addressing
File block addressing
Pointer block addressing
Can upgrade automatically.
15. System Files
System files are created at file system format
time, and each manages one type of
resources.
16. System Files
Use file blocks.
Same read/write method as regular files.
Checking file data consistency essentially
provides metadata consistency.
17. Cluster Groups
Cluster groups are repeated to create a file system.
An existing VMFS volume grows over unused space
on the disk or spans new disks by laying out new
cluster groups that refer to the newly added space.
VMFS resource manager makes hosts operate on
different and distant cluster groups within a system
file. This reduces the possibility of mutiple hosts
contending on the same lock(s) and increases the
efficiency of the clustered lock manager.
18. On-disk Lock
A single sector data
structure.
Locking is based on lease.
Atomic disk operations (SCSI
reserve-read-modify-write-
SCSI release)
19. On-disk Lock Data Structure
HostID: This is a 128-bit unique identifier that identifies the ESX host that
owns the lock at a given point in time. All zeros means no owner.
Mode: A set of non-zero values to indicate whether a lock is free, held
exclusively, held by multiple hosts for shared read access, or held by
multiple hosts for shared read and write access.
Generation: A monotonically increasing counter, updates every time a lock
is acquired, released or broken. While the hostID field sufficiently
disambiguates operations on a lock from different hosts, this field
disambiguates multiple operations on a lock by the same host.
HBregion: For each valid hostID (if any) currently using the lock, a pointer
to the on disk heartbeat region of the host.
HBgen: A generation number to validate the HBregion reference as being
current or stale. It disambiguates locks held by a given host before and
after a host crash and before and after a storage outage.
20. On-disk Heartbeat
A single sector data structure
Every host accessing a VMSF volume acquires
a heartbeat on disk to declare liveness to
other hosts.
Allocated from a 1MB reserved region of the
volume. 2048 concurrent hosts access.
21. HB Failure Handling
Hosts are free to break locks if heartbeat’s
timestamp does not change for 20 second. Should
replay journal when taking stale lock.
If failing to update heartbeat timestamp in five HB
period (about 15 sec and 40 HB IO tries), host will
fence itself and abort all inflight IOs.
Lock manager tries to rejoin the cluster if IO error is
not permanent, and reclaims HB slot.
22. On-disk Lock & HB
Each host can join a cluster by acquiring a on-
disk HB.
It can also hold thousands of on-disk locks
25. Optimistic Locking
All hosts in a VMFS cluster generally operate on
mutually exclusive subsets of locks on the volume.
A host that is interested in acquiring a given lock will
typically find it to be free on disk.
In stead of acquiring all locks, host first reads all
locks, if they are free, modify in memory metadata
and then upgrade locks and commit.
27. Transaction State Machine w/ op lock
Upgrade Lock
1: reserve disk;
2: issue asynchronous (async) reads of all
required locks;
3: if any lock is acquired by remote host,
abort and fall back to normal TSM;
4: issue async writes of all required locks;
5: wait for all async writes to complete;
6: release disk;
34. Directive SCSI CMD
atomic_test_and_set(block_number, old_image,
new_image)
For VMFS lock manager, new lock algorithm: reads a
lock image from disk, and if the lock is free, issues
an atomic_test_and_set with a new_image
containing host specific hostID, generation and
heartbeat information.
4 IOs -> 2 IOs