Contenu connexe Similaire à LCFS - Storage Driver for Docker (20) LCFS - Storage Driver for Docker1. 1© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS
Storage Driver For Docker
Jobi
FEB10, 2017
2. 2© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Every time you build, pull or destroy a Docker container, you are using
a storage driver.
Because it is designed only for containers, it is up to 2.5x faster to
build an image and up to almost 2x faster to pull an image.
We're looking forward to working with the container community to
improve and expand this new tool.
− Open Sourced (Apache 2.0)
− Use or Contribute!
https://github.com/portworx/lcfs
Exec Summary
3. 3© 2017 PORTWORX | LAYER CLONING FILESYSTEM
What is LCFS?
Layers are first class citizens
− Atomicity guarantees for each layer, not
at system call
Provides
− Efficient snapshotting/cloning
mechanism
− correctness guarantees to containers
A Posix File System in User space
(FUSE) in C
− No kernel modifications or license
issues
No configuration required
imagesource:DockerDocs
4. 4© 2017 PORTWORX | LAYER CLONING FILESYSTEM
What is a Graphdriver?
Docker image and container data repository
− And corresponding configuration data
It is a POSIX file system, with some special operations like
− Create read-only layer
− Create read-write layer
− Mount a layer
− Unmount a layer
− Delete a layer
Layers are mostly ephemeral (temporary)
Docker provides ordering of operations
5. 5© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Existing solutions
Union file systems vs. Snapshot based
Merged solutions (duplicated effort)
− AUFS on top of Ext4/XFS
− Overlay on top of Ext4/XFS
− Devicemapper on top of LVM/Ext4/XFS
Traditional solutions are optimized for file/block storage, persistent
data, point-in-time snapshots and clones, and all kinds of workflows
(mostly data constantly being modified)
− Not very efficient for storing ephemeral and mostly read-only layers
6. 6© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS Architecture
6
kernel
device
FUSE Library
Fedora image
Layers
MySQL image
Layers
Container 1
boot device
init
read/write
LCFS
• User mode
• Purpose built
• Native
Docker
Daemon
FUSE in Kernel
init
read/write
init
read/write
. . .
7. 7© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layers
Root Layer – docker configuration data & volumes
Base layer and read-only layers
Read-write layers (2 per container)
Data shared between layers in a tree
Layers track space allocated to data created in a layer
Each layer has an inode table
Strictly read-only once a layer is created on top
Thin provisioned and branch-on-write
8. 8© 2017 PORTWORX | LAYER CLONING FILESYSTEM
How layers different?
Layers can be created/deleted without pausing any running
containers
− cloning read-only layers is a lot simple
Data access time is constant for a container irrespective of the
number on containers of an image
− Different from point-in-time snapshots/clones, no roll back
Layers are deleted in the reverse order of creation
− Layers are not deleted in the beginning/middle of a chain
No reference counting of blocks
− Creation/Deletion time independent of size of device, size of data set and
number of layers
− Unlimited number of layers
9. 9© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layout
Unit of allocation is 4KB
Each layer has a super block
Superblocks are linked together to recreate the tree of layers on
remount
Root layer superblock tracks blocks where free space information is
tracked
Each layer tracks blocks where allocated space is tracked for the layer
Each layer tracks blocks where inodes are stored
Metadata blocks are checksummed
10. 10© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Space Management
Space is tracked using Extents (start block + count of blocks)
Free Extent Map of the whole file system
Allocated Extent Map for each layer
Each layer make reservations in large chunks and allocate from those
chunks
− Less locking of the global free list
− Better contiguity within a layer (separate chunks for user data, metadata
and inodes)
Minimum size for a device, Minimum free space for writes and layer
creation
11. 11© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Inodes
Each inode takes 128 bytes on disk
− Symbolic links are stored along with inode and inode consumes 4KB
− Access/Creation times not tracked
− Inode number is stored within the inode
Directory blocks are reachable from directory inodes
User data of single extent files reachable directly from the inode
Emap of fragmented files reachable from inode
Same the case with blocks tracking extended attributes
12. 12© 2017 PORTWORX | LAYER CLONING FILESYSTEM
File Handles
Formed using layer index + inode number
Layer index is unique for a layer, range between 0-64K
Inode number is unique globally
− inode numbers are shared between layers in a tree for shared files
Inode numbers are never reused
Creates duplicate copies of shared data in kernel page cache, but
those are invalidated as soon as file is closed
− May work better if FUSE is smarter here
13. 13© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Directory Tree
Global root of the file system with inode number 2
There is another directory called Layer Root Directory, created for
docker for placing root directory of all layers
− This directory cannot be deleted or many operations are not allowed
Atomic rename(2) is supported
No need to keep “whiteouts” for removed files as directories are
COWed
14. 14© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Locking
Each layer has a read-write lock, taken by all operations in shared
mode
A layer is locked exclusive while deleting it
Root layer is locked in shared mode while creating/deleting layers
Root layer is locked exclusive while unmounting the file system
15. 15© 2017 PORTWORX | LAYER CLONING FILESYSTEM
File Operations
Each inode has a read-write lock, taken in shared mode by read-only
operations and exclusive mode by modify operations – this lock is not
taken on frozen layers
Writes are acknowledged immediately after copying data to dirty page
cache of the file
fsync(2) is disabled
rmdir(2) in root layer succeeds even when directory is not empty
getxattr()/removexattr() are failed when the file system does not have
any extended attributes without looking up the inode
ioctl(2) support on layer root directory for creating/ mounting /
unmounting / deleting layers
16. 16© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Branch-On-Write (BOW - COW – Copy UP)
Inode is copied up on modification along with metadata like extended
attributes and directory entries or block map
− Shared metadata may be shared in cache even after copy up
User data blocks are BOWed on modification in 4KB sizes
− Most applications truncate the whole file and rewrite file with new data
17. 17© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Caching
All metadata stays in memory
− Inodes, directories, emaps, extended attributes, space extent maps,
symbolic links etc.
− Caching actual amount of metadata, not page aligned metadata
Each layer has a hash table for inodes
− Lookups may traverse the parent chain
Inodes have a dirty page list
Layers track hardlinks
Mostly using sequential lists (hashing scheme for large
directories and dirty page list)
18. 18© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Page Block Cache
File system blocks are cached in a private page cache, indexed by
block numbers for shared data
− Data not shared still use kernel page cache
Each Base image maintains a page cache and shared by all layers in
the tree which have the same base image
Shared by both user data and metadata
19. 19© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Data Placement
Space allocated to files at the time of sync, not when written
− Size of file known at the time of sync and never changes in a read-only
layer
− Most files can be placed contiguous on disk
− Temporary files and layers may not be written to disk
Small files and metadata are coalesced together as well
Zero blocks written do not consume space
Less metadata, less memory, less number of I/Os
20. 20© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layer Diff
Needed for docker commit/build operations to find paths modified in a
layer compared to parent layer
Uses custom diff driver – Not NaiveDiffDriver
− Except pre-existing layers after remount
Plugin invokes getxattr calls to get diff for a layer from LCFS
LCFS traverse the private icache of the layer and report inodes
instantiated in the layer
Only for generating diff from the parent layer
21. 21© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Crash Consistency
Docker Database of images and containers need to stay consistent
even after an abnormal shutdown of the graphdriver
Considering a checkpointing scheme over a journaling scheme
− Note fsync is disabled
22. 22© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Stats
Every operation in every layer is counted and total, maximum and
minimum time for each type of operation is tracked
This information can be presented in a tabular form on a per layer
basis on demand, periodically or at the time a layer is unmounted
Stats for a container can be restarted before running an application
for proper tracing
Memory usage tracked for each layer
Count of different file types in every layer is tracked
CPU profiling can be enabled with gperftools
23. 23© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Container stats
Running a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096
Stats for file system 0x1878680 with root 8130 index 7 at Thu Dec 8 09:26:30 2016
Layer created at Thu Dec 8 09:25:11 2016
Last acccessed at Thu Dec 8 09:26:14 2016
Request: Total Failed Average Max Min
LOOKUP: 110 34 0s.000010u 0s.000054u 0s.000003u
GETATTR: 36 0 0s.000005u 0s.000018u 0s.000003u
READLINK: 22 0 0s.000006u 0s.000023u 0s.000004u
OPEN: 43 0 0s.000005u 0s.000013u 0s.000003u
READ: 191 0 0s.000068u 0s.000266u 0s.000004u
FLUSH: 2 0 0s.000000u 0s.000000u 0s.000000u
RELEASE: 35 0 0s.000039u 0s.000430u 0s.000003u
OPENDIR: 1 0 0s.000007u 0s.000007u 0s.000007u
RELEASEDIR: 1 0 0s.000007u 0s.000007u 0s.000007u
CREATE: 1 0 0s.000011u 0s.000011u 0s.000011u
WRITE_BUF: 10000 0 0s.000008u 0s.000120u 0s.000003u
blocks allocated 1 freed 0
2 inodes 10000 pages
0 reads 0 writes (0 inodes written)
24. 24© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Container Memory stats
Running a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096
Memory Stats for file system 0x1435a00 with root 8130 index 7 at Fri Dec 9 06:15:15 2016
DIRENT Allocated 21 Freed 0
ICACHE Allocated 1 Freed 0
INODE Allocated 2 Freed 0
EXTENT Allocated 1 Freed 0
BLOCK Allocated 1 Freed 0
DATA Allocated 10000 Freed 0
DPAGEHASH Allocated 14 Freed 13
STATS Allocated 1 Freed 0
Total memory in use 41213339 bytes
25. 25© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull/Delete 30 popular images
0
100
200
300
400
500
600
700
800
Serial Pull Parallel Pull Serial Delete Parallel Delete
Devmapper btrfs Overlay Overlay2 Lcfs
26. 26© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull/Delete 30 popular images
0
50
100
150
200
250
300
350
400
450
500
Serial Pull Parallel Pull Serial Delete Parallel Delete
AUFS LCfs
27. 27© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull individual images
0
20
40
60
80
100
120
140
php-zendserver
gcc
hectcastro/riak
jenkins
wordpres
kibana
rails
node
rabbitmq
fedora/apache
logstash
elasticsearch
golang
tomcat
sysdig/sysdig
django
cassandra
mongo
postgres
mysql
mariadb
maven
redis
php
httpd
haproxy
nginx
memcached
gliderlabs/logspout
java
Overlay Overlay2 Lcfs
28. 28© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Spawn fedora/apache Containers
0
20
40
60
80
100
120
140
160
180
20 40 60 80 100
Devicemapper btrfs Overlay Overlay2 Lcfs
29. 29© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Spawn fedora/apache Containers
0
10
20
30
40
50
60
20 40 60 80 100
AUFS Lcfs
30. 30© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Remove fedora/apache Containers
0
10
20
30
40
50
60
70
20 40 60 80 100
Devmapper btrfs Overlay Overlay2 Lcfs
31. 31© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Remove fedora/apache Containers
0
5
10
15
20
25
30
35
40
45
20 40 60 80 100
AUFS Lcfs
32. 32© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Build Docker sources
0
200
400
600
800
1000
1200
1400
1600
Docker Build
Devmapper btrfs Overlay Overlay2 Lcfs
33. 33© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Build Docker sources
0
100
200
300
400
500
600
700
Docker Build
AUFS Lcfs
34. 34© 2017 PORTWORX | LAYER CLONING FILESYSTEM
IOPS with fiograph
docker run portworx/fiograph --blocksize=1024K --filename=/root/1g.bin --
ioengine=libaio --readwrite=read --size=1024M --name=test --gtod_reduce=1 --
iodepth=1 --time_based --runtime=60
0
1000
2000
3000
4000
5000
6000
7000
libaio splice
Devmapper
Overlay
Overlay2
Lcfs
35. 35© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS - A Docker V2 Graphdriver Plugin
Download & Build LCFS or install RPM
− git clone git@github.com:/portworx/lcfs, cd lcfs/lcfs, make
− rpm -Uvh http://yum.portworx.com/repo/rpms/px-graph/lcfs-0.0.0-
0.x86_64.rpm
Mount a device at /var/lib/docker and /lcfs
− ./lcfs <device/file> /var/lib/docker /lcfs –f
Start docker with vfs storage driver (1.13+)
− dockerd –s vfs
Install LCFS plugin
− docker plugin install portworx/lcfs
Restart docker with lcfs graphdriver
− dockerd –experimental –s portworx/lcfs
36. 36© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Pending tasks
Crash consistency
Metadata paging
Replace linear search algorithms
https://github.com/portworx/lcfs/issues
QA
37. 37© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Roadmap
QOS at container level (COS, IOPS, Quotas etc.)
Distributed Graphdriver (images shared)
Seamless container migration in a cluster
− Load Balancing
Backup/Restore of Graphdriver
37
38. 38© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Q&A
More info
− https://docs.docker.com/engine/userguide/storagedriver/imagesandcontai
ners/
− https://github.com/portworx/lcfs
Thank You!