SlideShare une entreprise Scribd logo
Fragmentation problem in vdisk enviroment
Dmitry Monakhov
2015-09-19
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 1 / 29
Outline
1 Introduction
2 FS fragmentation
3 An Era of Thin Provision Enviroment
4 Future work
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 2 / 29
Basic terminology
Filesystem divides it space in to blocks (usually 4k)
Files consists of blocks
File is fragmented if it's blocks are not continious
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 3 / 29
FS aging problem
Zillions of block-alloc, block-free iterations result in fs fragmentation
Most lesystem has eective and reliable techniques which prevents
fs aging
Block allocator try to spread data to whole disk
Block allocator try to pack small les together
Block allocator delay allocation untill close(2)/fsync(2)
Online/oine defragmentation tools [still required]
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 4 / 29
When defragmentation is required
There are situation when blockallocator tricks are not sucient
Filesystem is almost full (90%)
Weird falloc/unlink/fsync scenario
Special read pattern (boot speedup)
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 5 / 29
Fragmentation: More formal terminology
IntrA-le-Fragmentation(IAF) Fragmentation of a single le.
IntEr-le-Fragmentation(IEF) Fragmentation of a group of les
1
1Terminology from DFS paper
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 6 / 29
Existing tools
EXT4
Ioctl EXT4_MOVE_IOC (atomic) -
Swap blocks between donor and target file
Util: e4defrag(8) : defrag large files (*IAF*)
XFS
Ioctl XFS_IOC_SWAPEXT (non atomic)
Swap blocks between donor and target file
Util: xfs_fsr(8) defrag large files (*IAF*)
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 7 / 29
Basic disk layout
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 8 / 29
Virtual Disk: Things got complicated*
New indirection layer
Thin provision driver adds second space management layer, it divides
it space in to allocation blocks aka TPAB or buckets.
Bucket size != FS block size
TPAB is larger than fs block, but less than fs group
1M-4M Ploop, LVM-linear, QCOW2, Ceph(RBD)
64k-256k dm-thin,dm-snap
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 9 / 29
Virtual disk mapping
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 10 / 29
Customer's feedback
I've cotnainer with mail server inside which use 10Gb of data.
Your virtual disk use 40Gb of my super-fast SSD
WHY?
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 11 / 29
Virtual disk fragmentation example
root@dmlp:~# e2freefrag -c 4096 /dev/dm-1
Device: /dev/dm-1
Blocksize: 4096 bytes
Total blocks: 34126848
Free blocks: 12293324 (36.0%)
Chunksize: 4194304 bytes (1024 blocks)
Total chunks: 33328
Free chunks: 8379 (25.1%)
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 12 / 29
ThinProvision fragmentation problem
Visiable eect
Inecient free-space usage (up to 0.4%)
Bad IO performance
Why?
TRIM/Discard is useless
Existing FS defragmentation tools/techniques are useless
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 13 / 29
Who are aected?
Worst use-case
Many small les
A lot of create(2)/unlink(2)
Unpredictable lifetime
Massive write(2); sync(2)/fsync(2)
Bad pattern examples
Mail server
News server
Photo server
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 14 / 29
Image bloating example
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 15 / 29
New TP defragmentation API wanted
New TP-aware block allocator for FS
New TP-aware defragment tool
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 16 / 29
TP-aware defragmentation tool principles
Take in to account TP layout
Relocate group of les to according to one TPAB
The only question left
What to relocate?
Where to relocate?
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 17 / 29
TP-aware defragmentation overview
1 Sequential scan of the block bitmap tables. Collect used blocks
(build spextent tree)
2 Scan lesystem hierarchy and collect extents ownership statistics.
3 Rescan lesystem tree prepare list of candidates for IEF
defragmentation.
Fix IntrA-le-Fragmentation(IAF) issues if discovered
4 Process IEF list and perform actual defragmentation
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 18 / 29
Pass1
Sequential scan of the block bitmap tables.
Build free-space tree.
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 19 / 29
Pass2
Scan lesystem hierarchy and collect extents ownership statistics.
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 20 / 29
Pass3
Rescan lesystem tree prepare list of candidates for IEF
defragmentation.
Which candidates are good?
Files which belongs to partly populated claster
Readonly les (old mtime or executable les)
Small les
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 21 / 29
Pass3 image
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 22 / 29
Pass4
Process IEF list and perform actual defragmentation
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 23 / 29
Pass4 before
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 24 / 29
Pass4, after
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 25 / 29
Integration
OVZ case
call pcompact(8) nigtly from cron
pcompact invokes e4defrag2 and ploop compact for each ploop
Customer's feedback
Ok ploop image size is now ok, but...
Some times pcompact works all the time.
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 26 / 29
Source
GITHUB https://github.com/dmonakhov/e2fsprogs/blob/
e4defrag2/misc/e4defrag2.c
OVZ.GIT TODO add pcompact to git.openvz.org
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 27 / 29
[Future works] Stanrard bitmap scan API required
Currently used block info is obtained via e2fsprogs/xfs-progs
XFS: Analog FS-wide analog of FIEMAP
XFS_IOC_FIEMAPFS
Implement ioctl for EXT4
Move userspace to this new IOCTL
Massive testing and ne tuning.
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 28 / 29
[Future works2] Smart block allocator
Dave Chinner suggest smart block allocator which encapsulate all
smart-disk internals
Hide SMR internals
Hide TP internals
Garbage collection
Samrt block allocator API proposal
Place my data somewhere, and tell me location
Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 29 / 29

Contenu connexe

Tendances

Keith Paskett - Postgres on ZFS @ Postgres Open
Keith Paskett - Postgres on ZFS @ Postgres OpenKeith Paskett - Postgres on ZFS @ Postgres Open
Keith Paskett - Postgres on ZFS @ Postgres Open
PostgresOpen
 
Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0
bsd free
 
Cd rom mounting cdro-ms on solaris
Cd rom mounting cdro-ms on solarisCd rom mounting cdro-ms on solaris
Cd rom mounting cdro-ms on solaris
Bui Van Cuong
 
Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011
Accenture
 
Lavigne bsdmag-jan13
Lavigne bsdmag-jan13Lavigne bsdmag-jan13
Lavigne bsdmag-jan13
Dru Lavigne
 
Chapter 07
Chapter 07Chapter 07
Chapter 07
Google
 

Tendances (20)

OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
 
Lavigne aug11 bsdmag
Lavigne aug11 bsdmagLavigne aug11 bsdmag
Lavigne aug11 bsdmag
 
An introduction to the linux kernel and device drivers (NTU CSIE 2016.03)
An introduction to the linux kernel and device drivers (NTU CSIE 2016.03)An introduction to the linux kernel and device drivers (NTU CSIE 2016.03)
An introduction to the linux kernel and device drivers (NTU CSIE 2016.03)
 
Keith Paskett - Postgres on ZFS @ Postgres Open
Keith Paskett - Postgres on ZFS @ Postgres OpenKeith Paskett - Postgres on ZFS @ Postgres Open
Keith Paskett - Postgres on ZFS @ Postgres Open
 
Easy backup & restore with Clonezilla - Tips form Basic to Advanced
Easy backup & restore with Clonezilla - Tips form Basic to AdvancedEasy backup & restore with Clonezilla - Tips form Basic to Advanced
Easy backup & restore with Clonezilla - Tips form Basic to Advanced
 
SELF 2010: BSD For Linux Users
SELF 2010: BSD For Linux UsersSELF 2010: BSD For Linux Users
SELF 2010: BSD For Linux Users
 
The FreeBSD - PRIMER
The FreeBSD - PRIMERThe FreeBSD - PRIMER
The FreeBSD - PRIMER
 
Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0
 
Cd rom mounting cdro-ms on solaris
Cd rom mounting cdro-ms on solarisCd rom mounting cdro-ms on solaris
Cd rom mounting cdro-ms on solaris
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 
OpenZFS - AsiaBSDcon
OpenZFS - AsiaBSDconOpenZFS - AsiaBSDcon
OpenZFS - AsiaBSDcon
 
Bsdtw17: allan jude: zfs: advanced integration
Bsdtw17: allan jude: zfs: advanced integrationBsdtw17: allan jude: zfs: advanced integration
Bsdtw17: allan jude: zfs: advanced integration
 
PostgreSQL on ZFS Lightning Talk
PostgreSQL on ZFS Lightning TalkPostgreSQL on ZFS Lightning Talk
PostgreSQL on ZFS Lightning Talk
 
Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011
 
Linuxcon Barcelon 2012: LXC Best Practices
Linuxcon Barcelon 2012: LXC Best PracticesLinuxcon Barcelon 2012: LXC Best Practices
Linuxcon Barcelon 2012: LXC Best Practices
 
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliKernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
 
FUSE Filesystems
FUSE FilesystemsFUSE Filesystems
FUSE Filesystems
 
Lavigne bsdmag-jan13
Lavigne bsdmag-jan13Lavigne bsdmag-jan13
Lavigne bsdmag-jan13
 
Porting the drm/kms graphic drivers to DragonFlyBSD by Francois Tigeot
Porting the drm/kms graphic drivers to DragonFlyBSD by Francois TigeotPorting the drm/kms graphic drivers to DragonFlyBSD by Francois Tigeot
Porting the drm/kms graphic drivers to DragonFlyBSD by Francois Tigeot
 
Chapter 07
Chapter 07Chapter 07
Chapter 07
 

En vedette (12)

State of the Art Thin Provisioning
State of the Art Thin ProvisioningState of the Art Thin Provisioning
State of the Art Thin Provisioning
 
Munnay ki laash.pashto (مړ ما شوم)
Munnay ki laash.pashto (مړ ما شوم)Munnay ki laash.pashto (مړ ما شوم)
Munnay ki laash.pashto (مړ ما شوم)
 
Kristinehamn
Kristinehamn Kristinehamn
Kristinehamn
 
Viviendas 982 vpo ahijones
Viviendas 982 vpo ahijonesViviendas 982 vpo ahijones
Viviendas 982 vpo ahijones
 
ResumeLong
ResumeLongResumeLong
ResumeLong
 
Antivirus fair informatica
Antivirus fair informaticaAntivirus fair informatica
Antivirus fair informatica
 
TESTIMONIAL JOSEPHINE GARNER ATLANTIC REAL ESTATE (2)
TESTIMONIAL JOSEPHINE GARNER ATLANTIC REAL ESTATE (2)TESTIMONIAL JOSEPHINE GARNER ATLANTIC REAL ESTATE (2)
TESTIMONIAL JOSEPHINE GARNER ATLANTIC REAL ESTATE (2)
 
Jobba smart – att vara en bra lärare stockholm
Jobba smart – att vara en bra lärare stockholmJobba smart – att vara en bra lärare stockholm
Jobba smart – att vara en bra lärare stockholm
 
Alaela Issue 08
Alaela Issue 08Alaela Issue 08
Alaela Issue 08
 
Advocacy Plan and Application to Framework
Advocacy Plan and Application to Framework Advocacy Plan and Application to Framework
Advocacy Plan and Application to Framework
 
Live migration: pros, cons and gotchas -- Pavel Emelyanov
Live migration: pros, cons and gotchas -- Pavel EmelyanovLive migration: pros, cons and gotchas -- Pavel Emelyanov
Live migration: pros, cons and gotchas -- Pavel Emelyanov
 
What can QNAP Turbo NAS do for your business
What can QNAP Turbo NAS do for your businessWhat can QNAP Turbo NAS do for your business
What can QNAP Turbo NAS do for your business
 

Similaire à Проблема фрагментации виртуальных дисков и способы её решения -- Дмитрий Монахов

GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
Etsuji Nakai
 

Similaire à Проблема фрагментации виртуальных дисков и способы её решения -- Дмитрий Монахов (20)

Internet of Tiny Linux (IoTL): Episode IV - SFO17-100
Internet of Tiny Linux (IoTL): Episode IV  - SFO17-100Internet of Tiny Linux (IoTL): Episode IV  - SFO17-100
Internet of Tiny Linux (IoTL): Episode IV - SFO17-100
 
Newstalk week 20/2014
Newstalk week 20/2014Newstalk week 20/2014
Newstalk week 20/2014
 
LinuxLabs 2017 talk: Container monitoring challenges
LinuxLabs 2017 talk: Container monitoring challengesLinuxLabs 2017 talk: Container monitoring challenges
LinuxLabs 2017 talk: Container monitoring challenges
 
Android Custom Kernel/ROM design
Android Custom Kernel/ROM designAndroid Custom Kernel/ROM design
Android Custom Kernel/ROM design
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
 
Needle In An Encrypted Haystack: Forensics in a hardened environment (with Fu...
Needle In An Encrypted Haystack: Forensics in a hardened environment (with Fu...Needle In An Encrypted Haystack: Forensics in a hardened environment (with Fu...
Needle In An Encrypted Haystack: Forensics in a hardened environment (with Fu...
 
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, CitrixXPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
 
Clientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web BrowserClientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web Browser
 
Cloud RPI4 tomcat ARM64
Cloud RPI4 tomcat ARM64Cloud RPI4 tomcat ARM64
Cloud RPI4 tomcat ARM64
 
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKSAzure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
Azure Day Rome Reloaded 2019 - Deconstructing Kubernetes using AKS
 
GStreamer and SysLink (GStreamer Conference 2011)
GStreamer and SysLink (GStreamer Conference 2011)GStreamer and SysLink (GStreamer Conference 2011)
GStreamer and SysLink (GStreamer Conference 2011)
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on Android
 
Virtual Distro Dispatcher - A light-weight Desktop-as-a-Service solution
Virtual Distro Dispatcher - A light-weight Desktop-as-a-Service solutionVirtual Distro Dispatcher - A light-weight Desktop-as-a-Service solution
Virtual Distro Dispatcher - A light-weight Desktop-as-a-Service solution
 
Towards ruby-3x3-performance
Towards ruby-3x3-performanceTowards ruby-3x3-performance
Towards ruby-3x3-performance
 
Unikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based KernelsUnikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based Kernels
 
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness SpacesFog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
 
HUGFR : Une infrastructure Kafka & Storm pour lutter contre les attaques DDoS...
HUGFR : Une infrastructure Kafka & Storm pour lutter contre les attaques DDoS...HUGFR : Une infrastructure Kafka & Storm pour lutter contre les attaques DDoS...
HUGFR : Une infrastructure Kafka & Storm pour lutter contre les attaques DDoS...
 
PVOps Update
PVOps Update PVOps Update
PVOps Update
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloud
 
Lukas Macura - Employing Zabbix to monitor OpenWrt (Beesip) devices with Uciprov
Lukas Macura - Employing Zabbix to monitor OpenWrt (Beesip) devices with UciprovLukas Macura - Employing Zabbix to monitor OpenWrt (Beesip) devices with Uciprov
Lukas Macura - Employing Zabbix to monitor OpenWrt (Beesip) devices with Uciprov
 

Plus de OpenVZ

Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
OpenVZ
 
Containers in a file
Containers in a fileContainers in a file
Containers in a file
OpenVZ
 

Plus de OpenVZ (20)

PFcache - LinuxCon 2015
PFcache - LinuxCon 2015PFcache - LinuxCon 2015
PFcache - LinuxCon 2015
 
Speeding up ps and top
Speeding up ps and topSpeeding up ps and top
Speeding up ps and top
 
Live migrating a container: pros, cons and gotchas -- Pavel Emelyanov
Live migrating a container: pros, cons and gotchas -- Pavel EmelyanovLive migrating a container: pros, cons and gotchas -- Pavel Emelyanov
Live migrating a container: pros, cons and gotchas -- Pavel Emelyanov
 
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir KolyshkinCRIU: time and space travel for Linux containers -- Kir Kolyshkin
CRIU: time and space travel for Linux containers -- Kir Kolyshkin
 
Тестирование ПО, основанного на сторонних компонентах - Денис Силаков, SECR 2015
Тестирование ПО, основанного на сторонних компонентах - Денис Силаков, SECR 2015Тестирование ПО, основанного на сторонних компонентах - Денис Силаков, SECR 2015
Тестирование ПО, основанного на сторонних компонентах - Денис Силаков, SECR 2015
 
Живая миграция: плюсы, минусы и подводные камни - Павел Емельянов
Живая миграция: плюсы, минусы и подводные камни - Павел ЕмельяновЖивая миграция: плюсы, минусы и подводные камни - Павел Емельянов
Живая миграция: плюсы, минусы и подводные камни - Павел Емельянов
 
What's missing from upstream kernel containers? - Sergey Bronnikov
What's missing from upstream kernel containers? - Sergey BronnikovWhat's missing from upstream kernel containers? - Sergey Bronnikov
What's missing from upstream kernel containers? - Sergey Bronnikov
 
Развёртывание приложений Docker в контейнерах Virtuozzo -- Павел Тихомиров
Развёртывание приложений Docker в контейнерах Virtuozzo -- Павел ТихомировРазвёртывание приложений Docker в контейнерах Virtuozzo -- Павел Тихомиров
Развёртывание приложений Docker в контейнерах Virtuozzo -- Павел Тихомиров
 
CRIU: ускорение запуска PHP в CloudLinux OS -- Руслан Купреев
CRIU: ускорение запуска PHP в CloudLinux OS  -- Руслан КупреевCRIU: ускорение запуска PHP в CloudLinux OS  -- Руслан Купреев
CRIU: ускорение запуска PHP в CloudLinux OS -- Руслан Купреев
 
LibCT и контейнеры на уровне приложений -- Александр Бурлука
	LibCT и контейнеры на уровне приложений -- Александр Бурлука	LibCT и контейнеры на уровне приложений -- Александр Бурлука
LibCT и контейнеры на уровне приложений -- Александр Бурлука
 
Управление памятью контейнеров в проекте OpenVZ -- Владимир Давыдов
Управление памятью контейнеров в проекте OpenVZ -- Владимир ДавыдовУправление памятью контейнеров в проекте OpenVZ -- Владимир Давыдов
Управление памятью контейнеров в проекте OpenVZ -- Владимир Давыдов
 
Живая миграция контейнеров: плюсы, минусы, подводные камни -- Павел Емельянов
Живая миграция контейнеров: плюсы, минусы, подводные камни -- Павел ЕмельяновЖивая миграция контейнеров: плюсы, минусы, подводные камни -- Павел Емельянов
Живая миграция контейнеров: плюсы, минусы, подводные камни -- Павел Емельянов
 
LibCT: one lib to rule them all -- Andrey Vagin
LibCT: one lib to rule them all -- Andrey VaginLibCT: one lib to rule them all -- Andrey Vagin
LibCT: one lib to rule them all -- Andrey Vagin
 
Denser containers with PF cache - Pavel Emelyanov
Denser containers with PF cache - Pavel EmelyanovDenser containers with PF cache - Pavel Emelyanov
Denser containers with PF cache - Pavel Emelyanov
 
CGroups kernel memory controller -- Pavel Emelyanov
CGroups kernel memory controller -- Pavel EmelyanovCGroups kernel memory controller -- Pavel Emelyanov
CGroups kernel memory controller -- Pavel Emelyanov
 
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
 
Not so brief history of Linux Containers - Kir Kolyshkin
Not so brief history of Linux Containers - Kir KolyshkinNot so brief history of Linux Containers - Kir Kolyshkin
Not so brief history of Linux Containers - Kir Kolyshkin
 
Openvz booth
Openvz boothOpenvz booth
Openvz booth
 
Управление ресурсами в Linux и OpenVZ
Управление ресурсами в Linux и OpenVZ Управление ресурсами в Linux и OpenVZ
Управление ресурсами в Linux и OpenVZ
 
Containers in a file
Containers in a fileContainers in a file
Containers in a file
 

Dernier

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Dernier (20)

Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
How To Build a Successful SaaS Design.pdf
How To Build a Successful SaaS Design.pdfHow To Build a Successful SaaS Design.pdf
How To Build a Successful SaaS Design.pdf
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 

Проблема фрагментации виртуальных дисков и способы её решения -- Дмитрий Монахов

  • 1. Fragmentation problem in vdisk enviroment Dmitry Monakhov 2015-09-19 Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 1 / 29
  • 2. Outline 1 Introduction 2 FS fragmentation 3 An Era of Thin Provision Enviroment 4 Future work Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 2 / 29
  • 3. Basic terminology Filesystem divides it space in to blocks (usually 4k) Files consists of blocks File is fragmented if it's blocks are not continious Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 3 / 29
  • 4. FS aging problem Zillions of block-alloc, block-free iterations result in fs fragmentation Most lesystem has eective and reliable techniques which prevents fs aging Block allocator try to spread data to whole disk Block allocator try to pack small les together Block allocator delay allocation untill close(2)/fsync(2) Online/oine defragmentation tools [still required] Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 4 / 29
  • 5. When defragmentation is required There are situation when blockallocator tricks are not sucient Filesystem is almost full (90%) Weird falloc/unlink/fsync scenario Special read pattern (boot speedup) Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 5 / 29
  • 6. Fragmentation: More formal terminology IntrA-le-Fragmentation(IAF) Fragmentation of a single le. IntEr-le-Fragmentation(IEF) Fragmentation of a group of les 1 1Terminology from DFS paper Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 6 / 29
  • 7. Existing tools EXT4 Ioctl EXT4_MOVE_IOC (atomic) - Swap blocks between donor and target file Util: e4defrag(8) : defrag large files (*IAF*) XFS Ioctl XFS_IOC_SWAPEXT (non atomic) Swap blocks between donor and target file Util: xfs_fsr(8) defrag large files (*IAF*) Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 7 / 29
  • 8. Basic disk layout Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 8 / 29
  • 9. Virtual Disk: Things got complicated* New indirection layer Thin provision driver adds second space management layer, it divides it space in to allocation blocks aka TPAB or buckets. Bucket size != FS block size TPAB is larger than fs block, but less than fs group 1M-4M Ploop, LVM-linear, QCOW2, Ceph(RBD) 64k-256k dm-thin,dm-snap Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 9 / 29
  • 10. Virtual disk mapping Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 10 / 29
  • 11. Customer's feedback I've cotnainer with mail server inside which use 10Gb of data. Your virtual disk use 40Gb of my super-fast SSD WHY? Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 11 / 29
  • 12. Virtual disk fragmentation example root@dmlp:~# e2freefrag -c 4096 /dev/dm-1 Device: /dev/dm-1 Blocksize: 4096 bytes Total blocks: 34126848 Free blocks: 12293324 (36.0%) Chunksize: 4194304 bytes (1024 blocks) Total chunks: 33328 Free chunks: 8379 (25.1%) Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 12 / 29
  • 13. ThinProvision fragmentation problem Visiable eect Inecient free-space usage (up to 0.4%) Bad IO performance Why? TRIM/Discard is useless Existing FS defragmentation tools/techniques are useless Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 13 / 29
  • 14. Who are aected? Worst use-case Many small les A lot of create(2)/unlink(2) Unpredictable lifetime Massive write(2); sync(2)/fsync(2) Bad pattern examples Mail server News server Photo server Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 14 / 29
  • 15. Image bloating example Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 15 / 29
  • 16. New TP defragmentation API wanted New TP-aware block allocator for FS New TP-aware defragment tool Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 16 / 29
  • 17. TP-aware defragmentation tool principles Take in to account TP layout Relocate group of les to according to one TPAB The only question left What to relocate? Where to relocate? Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 17 / 29
  • 18. TP-aware defragmentation overview 1 Sequential scan of the block bitmap tables. Collect used blocks (build spextent tree) 2 Scan lesystem hierarchy and collect extents ownership statistics. 3 Rescan lesystem tree prepare list of candidates for IEF defragmentation. Fix IntrA-le-Fragmentation(IAF) issues if discovered 4 Process IEF list and perform actual defragmentation Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 18 / 29
  • 19. Pass1 Sequential scan of the block bitmap tables. Build free-space tree. Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 19 / 29
  • 20. Pass2 Scan lesystem hierarchy and collect extents ownership statistics. Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 20 / 29
  • 21. Pass3 Rescan lesystem tree prepare list of candidates for IEF defragmentation. Which candidates are good? Files which belongs to partly populated claster Readonly les (old mtime or executable les) Small les Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 21 / 29
  • 22. Pass3 image Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 22 / 29
  • 23. Pass4 Process IEF list and perform actual defragmentation Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 23 / 29
  • 24. Pass4 before Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 24 / 29
  • 25. Pass4, after Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 25 / 29
  • 26. Integration OVZ case call pcompact(8) nigtly from cron pcompact invokes e4defrag2 and ploop compact for each ploop Customer's feedback Ok ploop image size is now ok, but... Some times pcompact works all the time. Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 26 / 29
  • 27. Source GITHUB https://github.com/dmonakhov/e2fsprogs/blob/ e4defrag2/misc/e4defrag2.c OVZ.GIT TODO add pcompact to git.openvz.org Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 27 / 29
  • 28. [Future works] Stanrard bitmap scan API required Currently used block info is obtained via e2fsprogs/xfs-progs XFS: Analog FS-wide analog of FIEMAP XFS_IOC_FIEMAPFS Implement ioctl for EXT4 Move userspace to this new IOCTL Massive testing and ne tuning. Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 28 / 29
  • 29. [Future works2] Smart block allocator Dave Chinner suggest smart block allocator which encapsulate all smart-disk internals Hide SMR internals Hide TP internals Garbage collection Samrt block allocator API proposal Place my data somewhere, and tell me location Dmitry Monakhov Fragmentation problem in vdisk enviroment 2015-09-19 29 / 29