SlideShare une entreprise Scribd logo
1  sur  38
1© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS
Storage Driver For Docker
Jobi
FEB10, 2017
2© 2017 PORTWORX | LAYER CLONING FILESYSTEM
 Every time you build, pull or destroy a Docker container, you are using
a storage driver.
 Because it is designed only for containers, it is up to 2.5x faster to
build an image and up to almost 2x faster to pull an image.
 We're looking forward to working with the container community to
improve and expand this new tool.
− Open Sourced (Apache 2.0)
− Use or Contribute!
https://github.com/portworx/lcfs
Exec Summary
3© 2017 PORTWORX | LAYER CLONING FILESYSTEM
What is LCFS?
 Layers are first class citizens
− Atomicity guarantees for each layer, not
at system call
 Provides
− Efficient snapshotting/cloning
mechanism
− correctness guarantees to containers
 A Posix File System in User space
(FUSE) in C
− No kernel modifications or license
issues
 No configuration required
imagesource:DockerDocs
4© 2017 PORTWORX | LAYER CLONING FILESYSTEM
What is a Graphdriver?
 Docker image and container data repository
− And corresponding configuration data
 It is a POSIX file system, with some special operations like
− Create read-only layer
− Create read-write layer
− Mount a layer
− Unmount a layer
− Delete a layer
 Layers are mostly ephemeral (temporary)
 Docker provides ordering of operations
5© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Existing solutions
 Union file systems vs. Snapshot based
 Merged solutions (duplicated effort)
− AUFS on top of Ext4/XFS
− Overlay on top of Ext4/XFS
− Devicemapper on top of LVM/Ext4/XFS
 Traditional solutions are optimized for file/block storage, persistent
data, point-in-time snapshots and clones, and all kinds of workflows
(mostly data constantly being modified)
− Not very efficient for storing ephemeral and mostly read-only layers
6© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS Architecture
6
kernel
device
FUSE Library
Fedora image
Layers
MySQL image
Layers
Container 1
boot device
init
read/write
LCFS
• User mode
• Purpose built
• Native
Docker
Daemon
FUSE in Kernel
init
read/write
init
read/write
. . .
7© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layers
 Root Layer – docker configuration data & volumes
 Base layer and read-only layers
 Read-write layers (2 per container)
 Data shared between layers in a tree
 Layers track space allocated to data created in a layer
 Each layer has an inode table
 Strictly read-only once a layer is created on top
 Thin provisioned and branch-on-write
8© 2017 PORTWORX | LAYER CLONING FILESYSTEM
How layers different?
 Layers can be created/deleted without pausing any running
containers
− cloning read-only layers is a lot simple
 Data access time is constant for a container irrespective of the
number on containers of an image
− Different from point-in-time snapshots/clones, no roll back
 Layers are deleted in the reverse order of creation
− Layers are not deleted in the beginning/middle of a chain
 No reference counting of blocks
− Creation/Deletion time independent of size of device, size of data set and
number of layers
− Unlimited number of layers
9© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layout
 Unit of allocation is 4KB
 Each layer has a super block
 Superblocks are linked together to recreate the tree of layers on
remount
 Root layer superblock tracks blocks where free space information is
tracked
 Each layer tracks blocks where allocated space is tracked for the layer
 Each layer tracks blocks where inodes are stored
 Metadata blocks are checksummed
10© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Space Management
 Space is tracked using Extents (start block + count of blocks)
 Free Extent Map of the whole file system
 Allocated Extent Map for each layer
 Each layer make reservations in large chunks and allocate from those
chunks
− Less locking of the global free list
− Better contiguity within a layer (separate chunks for user data, metadata
and inodes)
 Minimum size for a device, Minimum free space for writes and layer
creation
11© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Inodes
 Each inode takes 128 bytes on disk
− Symbolic links are stored along with inode and inode consumes 4KB
− Access/Creation times not tracked
− Inode number is stored within the inode
 Directory blocks are reachable from directory inodes
 User data of single extent files reachable directly from the inode
 Emap of fragmented files reachable from inode
 Same the case with blocks tracking extended attributes
12© 2017 PORTWORX | LAYER CLONING FILESYSTEM
File Handles
 Formed using layer index + inode number
 Layer index is unique for a layer, range between 0-64K
 Inode number is unique globally
− inode numbers are shared between layers in a tree for shared files
 Inode numbers are never reused
 Creates duplicate copies of shared data in kernel page cache, but
those are invalidated as soon as file is closed
− May work better if FUSE is smarter here
13© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Directory Tree
 Global root of the file system with inode number 2
 There is another directory called Layer Root Directory, created for
docker for placing root directory of all layers
− This directory cannot be deleted or many operations are not allowed
 Atomic rename(2) is supported
 No need to keep “whiteouts” for removed files as directories are
COWed
14© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Locking
 Each layer has a read-write lock, taken by all operations in shared
mode
 A layer is locked exclusive while deleting it
 Root layer is locked in shared mode while creating/deleting layers
 Root layer is locked exclusive while unmounting the file system
15© 2017 PORTWORX | LAYER CLONING FILESYSTEM
File Operations
 Each inode has a read-write lock, taken in shared mode by read-only
operations and exclusive mode by modify operations – this lock is not
taken on frozen layers
 Writes are acknowledged immediately after copying data to dirty page
cache of the file
 fsync(2) is disabled
 rmdir(2) in root layer succeeds even when directory is not empty
 getxattr()/removexattr() are failed when the file system does not have
any extended attributes without looking up the inode
 ioctl(2) support on layer root directory for creating/ mounting /
unmounting / deleting layers
16© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Branch-On-Write (BOW - COW – Copy UP)
 Inode is copied up on modification along with metadata like extended
attributes and directory entries or block map
− Shared metadata may be shared in cache even after copy up
 User data blocks are BOWed on modification in 4KB sizes
− Most applications truncate the whole file and rewrite file with new data
17© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Caching
 All metadata stays in memory
− Inodes, directories, emaps, extended attributes, space extent maps,
symbolic links etc.
− Caching actual amount of metadata, not page aligned metadata
 Each layer has a hash table for inodes
− Lookups may traverse the parent chain
 Inodes have a dirty page list
 Layers track hardlinks
 Mostly using sequential lists (hashing scheme for large
directories and dirty page list)
18© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Page Block Cache
 File system blocks are cached in a private page cache, indexed by
block numbers for shared data
− Data not shared still use kernel page cache
 Each Base image maintains a page cache and shared by all layers in
the tree which have the same base image
 Shared by both user data and metadata
19© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Data Placement
 Space allocated to files at the time of sync, not when written
− Size of file known at the time of sync and never changes in a read-only
layer
− Most files can be placed contiguous on disk
− Temporary files and layers may not be written to disk
 Small files and metadata are coalesced together as well
 Zero blocks written do not consume space
 Less metadata, less memory, less number of I/Os
20© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layer Diff
 Needed for docker commit/build operations to find paths modified in a
layer compared to parent layer
 Uses custom diff driver – Not NaiveDiffDriver
− Except pre-existing layers after remount
 Plugin invokes getxattr calls to get diff for a layer from LCFS
 LCFS traverse the private icache of the layer and report inodes
instantiated in the layer
 Only for generating diff from the parent layer
21© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Crash Consistency
 Docker Database of images and containers need to stay consistent
even after an abnormal shutdown of the graphdriver
 Considering a checkpointing scheme over a journaling scheme
− Note fsync is disabled
22© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Stats
 Every operation in every layer is counted and total, maximum and
minimum time for each type of operation is tracked
 This information can be presented in a tabular form on a per layer
basis on demand, periodically or at the time a layer is unmounted
 Stats for a container can be restarted before running an application
for proper tracing
 Memory usage tracked for each layer
 Count of different file types in every layer is tracked
 CPU profiling can be enabled with gperftools
23© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Container stats
Running a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096
Stats for file system 0x1878680 with root 8130 index 7 at Thu Dec 8 09:26:30 2016
Layer created at Thu Dec 8 09:25:11 2016
Last acccessed at Thu Dec 8 09:26:14 2016
Request: Total Failed Average Max Min
LOOKUP: 110 34 0s.000010u 0s.000054u 0s.000003u
GETATTR: 36 0 0s.000005u 0s.000018u 0s.000003u
READLINK: 22 0 0s.000006u 0s.000023u 0s.000004u
OPEN: 43 0 0s.000005u 0s.000013u 0s.000003u
READ: 191 0 0s.000068u 0s.000266u 0s.000004u
FLUSH: 2 0 0s.000000u 0s.000000u 0s.000000u
RELEASE: 35 0 0s.000039u 0s.000430u 0s.000003u
OPENDIR: 1 0 0s.000007u 0s.000007u 0s.000007u
RELEASEDIR: 1 0 0s.000007u 0s.000007u 0s.000007u
CREATE: 1 0 0s.000011u 0s.000011u 0s.000011u
WRITE_BUF: 10000 0 0s.000008u 0s.000120u 0s.000003u
blocks allocated 1 freed 0
2 inodes 10000 pages
0 reads 0 writes (0 inodes written)
24© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Container Memory stats
Running a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096
Memory Stats for file system 0x1435a00 with root 8130 index 7 at Fri Dec 9 06:15:15 2016
DIRENT Allocated 21 Freed 0
ICACHE Allocated 1 Freed 0
INODE Allocated 2 Freed 0
EXTENT Allocated 1 Freed 0
BLOCK Allocated 1 Freed 0
DATA Allocated 10000 Freed 0
DPAGEHASH Allocated 14 Freed 13
STATS Allocated 1 Freed 0
Total memory in use 41213339 bytes
25© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull/Delete 30 popular images
0
100
200
300
400
500
600
700
800
Serial Pull Parallel Pull Serial Delete Parallel Delete
Devmapper btrfs Overlay Overlay2 Lcfs
26© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull/Delete 30 popular images
0
50
100
150
200
250
300
350
400
450
500
Serial Pull Parallel Pull Serial Delete Parallel Delete
AUFS LCfs
27© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull individual images
0
20
40
60
80
100
120
140
php-zendserver
gcc
hectcastro/riak
jenkins
wordpres
kibana
rails
node
rabbitmq
fedora/apache
logstash
elasticsearch
golang
tomcat
sysdig/sysdig
django
cassandra
mongo
postgres
mysql
mariadb
maven
redis
php
httpd
haproxy
nginx
memcached
gliderlabs/logspout
java
Overlay Overlay2 Lcfs
28© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Spawn fedora/apache Containers
0
20
40
60
80
100
120
140
160
180
20 40 60 80 100
Devicemapper btrfs Overlay Overlay2 Lcfs
29© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Spawn fedora/apache Containers
0
10
20
30
40
50
60
20 40 60 80 100
AUFS Lcfs
30© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Remove fedora/apache Containers
0
10
20
30
40
50
60
70
20 40 60 80 100
Devmapper btrfs Overlay Overlay2 Lcfs
31© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Remove fedora/apache Containers
0
5
10
15
20
25
30
35
40
45
20 40 60 80 100
AUFS Lcfs
32© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Build Docker sources
0
200
400
600
800
1000
1200
1400
1600
Docker Build
Devmapper btrfs Overlay Overlay2 Lcfs
33© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Build Docker sources
0
100
200
300
400
500
600
700
Docker Build
AUFS Lcfs
34© 2017 PORTWORX | LAYER CLONING FILESYSTEM
IOPS with fiograph
docker run portworx/fiograph --blocksize=1024K --filename=/root/1g.bin --
ioengine=libaio --readwrite=read --size=1024M --name=test --gtod_reduce=1 --
iodepth=1 --time_based --runtime=60
0
1000
2000
3000
4000
5000
6000
7000
libaio splice
Devmapper
Overlay
Overlay2
Lcfs
35© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS - A Docker V2 Graphdriver Plugin
 Download & Build LCFS or install RPM
− git clone git@github.com:/portworx/lcfs, cd lcfs/lcfs, make
− rpm -Uvh http://yum.portworx.com/repo/rpms/px-graph/lcfs-0.0.0-
0.x86_64.rpm
 Mount a device at /var/lib/docker and /lcfs
− ./lcfs <device/file> /var/lib/docker /lcfs –f
 Start docker with vfs storage driver (1.13+)
− dockerd –s vfs
 Install LCFS plugin
− docker plugin install portworx/lcfs
 Restart docker with lcfs graphdriver
− dockerd –experimental –s portworx/lcfs
36© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Pending tasks
 Crash consistency
 Metadata paging
 Replace linear search algorithms
 https://github.com/portworx/lcfs/issues
 QA
37© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Roadmap
 QOS at container level (COS, IOPS, Quotas etc.)
 Distributed Graphdriver (images shared)
 Seamless container migration in a cluster
− Load Balancing
 Backup/Restore of Graphdriver
37
38© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Q&A
 More info
− https://docs.docker.com/engine/userguide/storagedriver/imagesandcontai
ners/
− https://github.com/portworx/lcfs
 Thank You!

Contenu connexe

Tendances

Bash Shell Scripting
Bash Shell ScriptingBash Shell Scripting
Bash Shell Scripting
Raghu nath
 
Lista de chequeo compu ayudas mayo
Lista de chequeo compu ayudas mayoLista de chequeo compu ayudas mayo
Lista de chequeo compu ayudas mayo
Alberto Vargas
 

Tendances (20)

Tietoturva ja IT-arkkitehtuuri
Tietoturva ja IT-arkkitehtuuriTietoturva ja IT-arkkitehtuuri
Tietoturva ja IT-arkkitehtuuri
 
Maltego Webinar Slides
Maltego Webinar SlidesMaltego Webinar Slides
Maltego Webinar Slides
 
Users and groups in Linux
Users and groups in LinuxUsers and groups in Linux
Users and groups in Linux
 
Linux Instrumentation
Linux InstrumentationLinux Instrumentation
Linux Instrumentation
 
Squid
SquidSquid
Squid
 
KKBOX WWDC17 Airplay 2 - Dolphin
KKBOX WWDC17 Airplay 2 - DolphinKKBOX WWDC17 Airplay 2 - Dolphin
KKBOX WWDC17 Airplay 2 - Dolphin
 
XXE
XXEXXE
XXE
 
Windows attacks - AT is the new black
Windows attacks - AT is the new blackWindows attacks - AT is the new black
Windows attacks - AT is the new black
 
Using OpenLDAP
Using OpenLDAPUsing OpenLDAP
Using OpenLDAP
 
Evading Microsoft ATA for Active Directory Domination
Evading Microsoft ATA for Active Directory DominationEvading Microsoft ATA for Active Directory Domination
Evading Microsoft ATA for Active Directory Domination
 
Power on self
Power on selfPower on self
Power on self
 
Course 102: Lecture 12: Basic Text Handling
Course 102: Lecture 12: Basic Text Handling Course 102: Lecture 12: Basic Text Handling
Course 102: Lecture 12: Basic Text Handling
 
Manual de Orfeo
Manual de OrfeoManual de Orfeo
Manual de Orfeo
 
Weaponizing Recon - Smashing Applications for Security Vulnerabilities & Profits
Weaponizing Recon - Smashing Applications for Security Vulnerabilities & ProfitsWeaponizing Recon - Smashing Applications for Security Vulnerabilities & Profits
Weaponizing Recon - Smashing Applications for Security Vulnerabilities & Profits
 
Booting Android: bootloaders, fastboot and boot images
Booting Android: bootloaders, fastboot and boot imagesBooting Android: bootloaders, fastboot and boot images
Booting Android: bootloaders, fastboot and boot images
 
Bash Shell Scripting
Bash Shell ScriptingBash Shell Scripting
Bash Shell Scripting
 
Character Drivers
Character DriversCharacter Drivers
Character Drivers
 
Lista de chequeo compu ayudas mayo
Lista de chequeo compu ayudas mayoLista de chequeo compu ayudas mayo
Lista de chequeo compu ayudas mayo
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016
 

Similaire à LCFS - Storage Driver for Docker

Distributed File System
Distributed File SystemDistributed File System
Distributed File System
Ntu
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systems
AbDul ThaYyal
 

Similaire à LCFS - Storage Driver for Docker (20)

Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file system
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File System
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Gpfs introandsetup
Gpfs introandsetupGpfs introandsetup
Gpfs introandsetup
 
Lisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introductionLisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introduction
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case study
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Posscon2013
Posscon2013Posscon2013
Posscon2013
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Introduction to OS LEVEL Virtualization & Containers
Introduction to OS LEVEL Virtualization & ContainersIntroduction to OS LEVEL Virtualization & Containers
Introduction to OS LEVEL Virtualization & Containers
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
Root file system for embedded systems
Root file system for embedded systemsRoot file system for embedded systems
Root file system for embedded systems
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systems
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

LCFS - Storage Driver for Docker

  • 1. 1© 2017 PORTWORX | LAYER CLONING FILESYSTEM LCFS Storage Driver For Docker Jobi FEB10, 2017
  • 2. 2© 2017 PORTWORX | LAYER CLONING FILESYSTEM  Every time you build, pull or destroy a Docker container, you are using a storage driver.  Because it is designed only for containers, it is up to 2.5x faster to build an image and up to almost 2x faster to pull an image.  We're looking forward to working with the container community to improve and expand this new tool. − Open Sourced (Apache 2.0) − Use or Contribute! https://github.com/portworx/lcfs Exec Summary
  • 3. 3© 2017 PORTWORX | LAYER CLONING FILESYSTEM What is LCFS?  Layers are first class citizens − Atomicity guarantees for each layer, not at system call  Provides − Efficient snapshotting/cloning mechanism − correctness guarantees to containers  A Posix File System in User space (FUSE) in C − No kernel modifications or license issues  No configuration required imagesource:DockerDocs
  • 4. 4© 2017 PORTWORX | LAYER CLONING FILESYSTEM What is a Graphdriver?  Docker image and container data repository − And corresponding configuration data  It is a POSIX file system, with some special operations like − Create read-only layer − Create read-write layer − Mount a layer − Unmount a layer − Delete a layer  Layers are mostly ephemeral (temporary)  Docker provides ordering of operations
  • 5. 5© 2017 PORTWORX | LAYER CLONING FILESYSTEM Existing solutions  Union file systems vs. Snapshot based  Merged solutions (duplicated effort) − AUFS on top of Ext4/XFS − Overlay on top of Ext4/XFS − Devicemapper on top of LVM/Ext4/XFS  Traditional solutions are optimized for file/block storage, persistent data, point-in-time snapshots and clones, and all kinds of workflows (mostly data constantly being modified) − Not very efficient for storing ephemeral and mostly read-only layers
  • 6. 6© 2017 PORTWORX | LAYER CLONING FILESYSTEM LCFS Architecture 6 kernel device FUSE Library Fedora image Layers MySQL image Layers Container 1 boot device init read/write LCFS • User mode • Purpose built • Native Docker Daemon FUSE in Kernel init read/write init read/write . . .
  • 7. 7© 2017 PORTWORX | LAYER CLONING FILESYSTEM Layers  Root Layer – docker configuration data & volumes  Base layer and read-only layers  Read-write layers (2 per container)  Data shared between layers in a tree  Layers track space allocated to data created in a layer  Each layer has an inode table  Strictly read-only once a layer is created on top  Thin provisioned and branch-on-write
  • 8. 8© 2017 PORTWORX | LAYER CLONING FILESYSTEM How layers different?  Layers can be created/deleted without pausing any running containers − cloning read-only layers is a lot simple  Data access time is constant for a container irrespective of the number on containers of an image − Different from point-in-time snapshots/clones, no roll back  Layers are deleted in the reverse order of creation − Layers are not deleted in the beginning/middle of a chain  No reference counting of blocks − Creation/Deletion time independent of size of device, size of data set and number of layers − Unlimited number of layers
  • 9. 9© 2017 PORTWORX | LAYER CLONING FILESYSTEM Layout  Unit of allocation is 4KB  Each layer has a super block  Superblocks are linked together to recreate the tree of layers on remount  Root layer superblock tracks blocks where free space information is tracked  Each layer tracks blocks where allocated space is tracked for the layer  Each layer tracks blocks where inodes are stored  Metadata blocks are checksummed
  • 10. 10© 2017 PORTWORX | LAYER CLONING FILESYSTEM Space Management  Space is tracked using Extents (start block + count of blocks)  Free Extent Map of the whole file system  Allocated Extent Map for each layer  Each layer make reservations in large chunks and allocate from those chunks − Less locking of the global free list − Better contiguity within a layer (separate chunks for user data, metadata and inodes)  Minimum size for a device, Minimum free space for writes and layer creation
  • 11. 11© 2017 PORTWORX | LAYER CLONING FILESYSTEM Inodes  Each inode takes 128 bytes on disk − Symbolic links are stored along with inode and inode consumes 4KB − Access/Creation times not tracked − Inode number is stored within the inode  Directory blocks are reachable from directory inodes  User data of single extent files reachable directly from the inode  Emap of fragmented files reachable from inode  Same the case with blocks tracking extended attributes
  • 12. 12© 2017 PORTWORX | LAYER CLONING FILESYSTEM File Handles  Formed using layer index + inode number  Layer index is unique for a layer, range between 0-64K  Inode number is unique globally − inode numbers are shared between layers in a tree for shared files  Inode numbers are never reused  Creates duplicate copies of shared data in kernel page cache, but those are invalidated as soon as file is closed − May work better if FUSE is smarter here
  • 13. 13© 2017 PORTWORX | LAYER CLONING FILESYSTEM Directory Tree  Global root of the file system with inode number 2  There is another directory called Layer Root Directory, created for docker for placing root directory of all layers − This directory cannot be deleted or many operations are not allowed  Atomic rename(2) is supported  No need to keep “whiteouts” for removed files as directories are COWed
  • 14. 14© 2017 PORTWORX | LAYER CLONING FILESYSTEM Locking  Each layer has a read-write lock, taken by all operations in shared mode  A layer is locked exclusive while deleting it  Root layer is locked in shared mode while creating/deleting layers  Root layer is locked exclusive while unmounting the file system
  • 15. 15© 2017 PORTWORX | LAYER CLONING FILESYSTEM File Operations  Each inode has a read-write lock, taken in shared mode by read-only operations and exclusive mode by modify operations – this lock is not taken on frozen layers  Writes are acknowledged immediately after copying data to dirty page cache of the file  fsync(2) is disabled  rmdir(2) in root layer succeeds even when directory is not empty  getxattr()/removexattr() are failed when the file system does not have any extended attributes without looking up the inode  ioctl(2) support on layer root directory for creating/ mounting / unmounting / deleting layers
  • 16. 16© 2017 PORTWORX | LAYER CLONING FILESYSTEM Branch-On-Write (BOW - COW – Copy UP)  Inode is copied up on modification along with metadata like extended attributes and directory entries or block map − Shared metadata may be shared in cache even after copy up  User data blocks are BOWed on modification in 4KB sizes − Most applications truncate the whole file and rewrite file with new data
  • 17. 17© 2017 PORTWORX | LAYER CLONING FILESYSTEM Caching  All metadata stays in memory − Inodes, directories, emaps, extended attributes, space extent maps, symbolic links etc. − Caching actual amount of metadata, not page aligned metadata  Each layer has a hash table for inodes − Lookups may traverse the parent chain  Inodes have a dirty page list  Layers track hardlinks  Mostly using sequential lists (hashing scheme for large directories and dirty page list)
  • 18. 18© 2017 PORTWORX | LAYER CLONING FILESYSTEM Page Block Cache  File system blocks are cached in a private page cache, indexed by block numbers for shared data − Data not shared still use kernel page cache  Each Base image maintains a page cache and shared by all layers in the tree which have the same base image  Shared by both user data and metadata
  • 19. 19© 2017 PORTWORX | LAYER CLONING FILESYSTEM Data Placement  Space allocated to files at the time of sync, not when written − Size of file known at the time of sync and never changes in a read-only layer − Most files can be placed contiguous on disk − Temporary files and layers may not be written to disk  Small files and metadata are coalesced together as well  Zero blocks written do not consume space  Less metadata, less memory, less number of I/Os
  • 20. 20© 2017 PORTWORX | LAYER CLONING FILESYSTEM Layer Diff  Needed for docker commit/build operations to find paths modified in a layer compared to parent layer  Uses custom diff driver – Not NaiveDiffDriver − Except pre-existing layers after remount  Plugin invokes getxattr calls to get diff for a layer from LCFS  LCFS traverse the private icache of the layer and report inodes instantiated in the layer  Only for generating diff from the parent layer
  • 21. 21© 2017 PORTWORX | LAYER CLONING FILESYSTEM Crash Consistency  Docker Database of images and containers need to stay consistent even after an abnormal shutdown of the graphdriver  Considering a checkpointing scheme over a journaling scheme − Note fsync is disabled
  • 22. 22© 2017 PORTWORX | LAYER CLONING FILESYSTEM Stats  Every operation in every layer is counted and total, maximum and minimum time for each type of operation is tracked  This information can be presented in a tabular form on a per layer basis on demand, periodically or at the time a layer is unmounted  Stats for a container can be restarted before running an application for proper tracing  Memory usage tracked for each layer  Count of different file types in every layer is tracked  CPU profiling can be enabled with gperftools
  • 23. 23© 2017 PORTWORX | LAYER CLONING FILESYSTEM Container stats Running a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096 Stats for file system 0x1878680 with root 8130 index 7 at Thu Dec 8 09:26:30 2016 Layer created at Thu Dec 8 09:25:11 2016 Last acccessed at Thu Dec 8 09:26:14 2016 Request: Total Failed Average Max Min LOOKUP: 110 34 0s.000010u 0s.000054u 0s.000003u GETATTR: 36 0 0s.000005u 0s.000018u 0s.000003u READLINK: 22 0 0s.000006u 0s.000023u 0s.000004u OPEN: 43 0 0s.000005u 0s.000013u 0s.000003u READ: 191 0 0s.000068u 0s.000266u 0s.000004u FLUSH: 2 0 0s.000000u 0s.000000u 0s.000000u RELEASE: 35 0 0s.000039u 0s.000430u 0s.000003u OPENDIR: 1 0 0s.000007u 0s.000007u 0s.000007u RELEASEDIR: 1 0 0s.000007u 0s.000007u 0s.000007u CREATE: 1 0 0s.000011u 0s.000011u 0s.000011u WRITE_BUF: 10000 0 0s.000008u 0s.000120u 0s.000003u blocks allocated 1 freed 0 2 inodes 10000 pages 0 reads 0 writes (0 inodes written)
  • 24. 24© 2017 PORTWORX | LAYER CLONING FILESYSTEM Container Memory stats Running a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096 Memory Stats for file system 0x1435a00 with root 8130 index 7 at Fri Dec 9 06:15:15 2016 DIRENT Allocated 21 Freed 0 ICACHE Allocated 1 Freed 0 INODE Allocated 2 Freed 0 EXTENT Allocated 1 Freed 0 BLOCK Allocated 1 Freed 0 DATA Allocated 10000 Freed 0 DPAGEHASH Allocated 14 Freed 13 STATS Allocated 1 Freed 0 Total memory in use 41213339 bytes
  • 25. 25© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Pull/Delete 30 popular images 0 100 200 300 400 500 600 700 800 Serial Pull Parallel Pull Serial Delete Parallel Delete Devmapper btrfs Overlay Overlay2 Lcfs
  • 26. 26© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Pull/Delete 30 popular images 0 50 100 150 200 250 300 350 400 450 500 Serial Pull Parallel Pull Serial Delete Parallel Delete AUFS LCfs
  • 27. 27© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Pull individual images 0 20 40 60 80 100 120 140 php-zendserver gcc hectcastro/riak jenkins wordpres kibana rails node rabbitmq fedora/apache logstash elasticsearch golang tomcat sysdig/sysdig django cassandra mongo postgres mysql mariadb maven redis php httpd haproxy nginx memcached gliderlabs/logspout java Overlay Overlay2 Lcfs
  • 28. 28© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Spawn fedora/apache Containers 0 20 40 60 80 100 120 140 160 180 20 40 60 80 100 Devicemapper btrfs Overlay Overlay2 Lcfs
  • 29. 29© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Spawn fedora/apache Containers 0 10 20 30 40 50 60 20 40 60 80 100 AUFS Lcfs
  • 30. 30© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Remove fedora/apache Containers 0 10 20 30 40 50 60 70 20 40 60 80 100 Devmapper btrfs Overlay Overlay2 Lcfs
  • 31. 31© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Remove fedora/apache Containers 0 5 10 15 20 25 30 35 40 45 20 40 60 80 100 AUFS Lcfs
  • 32. 32© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Build Docker sources 0 200 400 600 800 1000 1200 1400 1600 Docker Build Devmapper btrfs Overlay Overlay2 Lcfs
  • 33. 33© 2017 PORTWORX | LAYER CLONING FILESYSTEM Time to Build Docker sources 0 100 200 300 400 500 600 700 Docker Build AUFS Lcfs
  • 34. 34© 2017 PORTWORX | LAYER CLONING FILESYSTEM IOPS with fiograph docker run portworx/fiograph --blocksize=1024K --filename=/root/1g.bin -- ioengine=libaio --readwrite=read --size=1024M --name=test --gtod_reduce=1 -- iodepth=1 --time_based --runtime=60 0 1000 2000 3000 4000 5000 6000 7000 libaio splice Devmapper Overlay Overlay2 Lcfs
  • 35. 35© 2017 PORTWORX | LAYER CLONING FILESYSTEM LCFS - A Docker V2 Graphdriver Plugin  Download & Build LCFS or install RPM − git clone git@github.com:/portworx/lcfs, cd lcfs/lcfs, make − rpm -Uvh http://yum.portworx.com/repo/rpms/px-graph/lcfs-0.0.0- 0.x86_64.rpm  Mount a device at /var/lib/docker and /lcfs − ./lcfs <device/file> /var/lib/docker /lcfs –f  Start docker with vfs storage driver (1.13+) − dockerd –s vfs  Install LCFS plugin − docker plugin install portworx/lcfs  Restart docker with lcfs graphdriver − dockerd –experimental –s portworx/lcfs
  • 36. 36© 2017 PORTWORX | LAYER CLONING FILESYSTEM Pending tasks  Crash consistency  Metadata paging  Replace linear search algorithms  https://github.com/portworx/lcfs/issues  QA
  • 37. 37© 2017 PORTWORX | LAYER CLONING FILESYSTEM Roadmap  QOS at container level (COS, IOPS, Quotas etc.)  Distributed Graphdriver (images shared)  Seamless container migration in a cluster − Load Balancing  Backup/Restore of Graphdriver 37
  • 38. 38© 2017 PORTWORX | LAYER CLONING FILESYSTEM Q&A  More info − https://docs.docker.com/engine/userguide/storagedriver/imagesandcontai ners/ − https://github.com/portworx/lcfs  Thank You!