SlideShare a Scribd company logo
1 of 8
FSCK SX
      An approach to FSCK Performance Enhancement
            Gaurav Naigaonkar, Sanjyot, Tipnis, Ajay Mandvekar, Moksh Matta
                          {gnaigaonkar, sanj312} @gmail.com
                     {ajay30_sam, mokshmatta1004} @hotmail.com
                      Pune Institute of Computer Technology, Pune


               Abstract
                                                1. Introduction
         File System Check Program (fsck)
is an interactive file system check and                  File system repair is usually an
repair program. Fsck uses the redundant         afterthought for file system designers. One
structural information in the UNIX file         reason is that repair is difficult and
system to perform several consistency           annoying to reason about. It‘s neither
checks. Unfortunately, disk capacity is         possible nor worthwhile to fix all error
growing faster than disk bandwidth, seek        modes so we must focus our efforts on the
times are hardly budging, and the overall       ones that commonly occur, yet we do not
chance of an I/O error occurring                know what they are until we encounter
somewhere on the disk is increasing. The        them in the wild. In practice, most file
result: the traditional file system check and   system repair code is written in response to
repair cycle will be not only longer, but       an observed corruption mode. File system
also more frequent, with disastrous             repair is annoying because, by definition,
consequences for data availability. Data        something went wrong and we must think
reliability will also decline with frequency    outside the state space of our beautifully
of corruption.                                  designed system. In the end, designing a
                                                file system is more fun than designing a
    We have implemented techniques for          file system checker.
reducing the average fsck time on ext3 file
systems.      First,  we     improve     the            For many years, we could brush off
performance by parallelizing the two            the importance of making file system
major operations of fsck – fetching             repair fast and reliable with the following
metadata      and    checking    it.    This    chain of reasoning: File system corruption
multithreaded operation, along with an          is a rare event, and when it does occur,
intelligent issuing of IO requests, helps to    repairing it takes only a few minutes or
greatly reduce the overall seek time. We        maybe a few hours of downtime, and if
have also implemented ‗Metaclustering‘          repair is too difficult or time–consuming,
wherein we store the indirect blocks in         ―That‘s     what     backups    are    for.‖
clusters on a per group basis instead of        Unfortunately, if this reasoning was ever
spreading them out along with the data          valid, it is being eroded by some
blocks. This makes fsck even faster             inconvenient truths about disk hardware
since it can now read and verify all            trends.
indirect blocks without much seek.                                2006   2009   2013   Change
                                                Capacity (GB)     500    2000   8000   16x
Keywords: Fsck, Parallelism, Metadata           Bandwidth(Mb/s)   1000   2000   5000   5x
clustering, Readahead                           Seek Time(ms)     8      7.2    6.5    1.2x
                                                 Table 1: Projected disk hardware trends
As Table 1 shows, Seagate projects      checks to see if the ext3 file system was
that during the same time that disk             cleanly un-mounted by reading the state
capacity increases by 16 times, disk            field in the file system superblock. If the
bandwidth will increase by only 5 times,        state is set as VALID, the file system is
and seek times will remain nearly               already consistent and does not need
unchanged. This is good news for many           recovery; fsck exits without further ado. If
common workloads—we can store more              the state is INVALID, fsck does a full
data and read and write more of it at once.     check of the file system integrity, repairing
But it is terrible news for any workload        any inconsistencies it finds. In order to
that is (a) proportional to the size of the     check the correctness of allocation
disk, (b) throughput–intensive, and (c)         bitmaps, file nlinks, directory entries, etc.,
seek–intensive.                                 fsck reads every inode in the system, every
                                                indirect block referenced by an inode, and
         One workload that fits this profile    every directory entry. Using this
is file system check and repair. It has been    information, it builds up a new set of inode
calculated that file system check and repair    and block allocation bitmaps, calculates
time will increase by approximately a           the correct number of links of every inode,
factor of 10 between 2006 and 2013 with         and removes directory entries to
today‘s file system formats.                    unreferenced inodes. It does many other
                                                things as well, such as sanity check inode
         At the same time that capacity is      fields, but these three activities
increasing, the per–bit error rate is           fundamentally require reading every inode
improving. However, for an overall              in the file system. Otherwise, there is no
improvement in the error rate for               way to find out whether, for example, a
operations that read data proportional to       particular block is referenced by a file but
the file system size (such as fsck), the per–   is marked as unallocated on the block
bit error rate must improve as fast as          allocation bitmap. In summary, there are
capacity grows, which seems unlikely. We        no back pointers from a data block to the
conclude that the frequency of file system      indirect block that points to it or from a
corruption and necessary check and repair       file to the directories that point to it, so the
or restore is more likely to increase than      only way to reconstruct reference counts is
decrease. This combination of increasing        to start at the top level and build a
fsck time and increasing fsck frequency is      complete picture of the file system
what we call the fsck time crunch.              metadata.

2. The fsck program                                    Unsurprisingly, it takes fsck quite
                                                some time to rebuild the entirety of the file
        Cutting down crash recovery time        system metadata, approximately O(total
for an ext3 file system depends on              file system size + data stored). The
understanding how the file system checker       average laptop takes several minutes to
program, fsck works. After Linux has            fsck an ext2 file system; large file servers
finished booting the kernel, the root file      can sometimes take hours or, on occasion,
system is mounted read-only and the             days!
kernel executes the init program. As part
of normal system initialization, fsck is run           Straightforward           tactical
on the root file system before it is            performance optimizations such as
remounted read-write and on other file          requesting reads of needed blocks in
systems before they are mounted. Repair         sequential order and readahead requests
of the file system is necessary before it can   can only improve the situation so much,
be safely written. When fsck runs, it           given that the whole operation will still
take time proportional to the entire file      read off disk. This takes a relatively small
system. What we want is file system            amount of time compared to the time spent
recovery time that is O(writes in progress),   doing what are effectively random 4 KB or
as is the case for journal replay in           similar–sized reads, although more
journaling file systems.                       complex file systems may burn more CPU
                                               time in computing checksums or similar
3. Motivation                                  tasks. In summary, the ways to reduce fsck
                                               time, in rough order of effectiveness, are to
                                               reduce seeks, reduce dependent reads,
        The fundamental limiting factors in
                                               reduce the amount of metadata that needs
the performance of fsck are amount of data
                                               to be read (either by reducing the overall
read, number of separate I/Os, how
                                               quantity or the amount that needs to be
scattered the data is on disk, number of
                                               read), and to reduce the complexity of the
dependent reads, and CPU time required to
                                               consistency checks themselves.
check and repair the data read. The amount
of memory available is a factor as well,
though most fsck programs operate on an        4. Our Approach
all–or–nothing basis: Either there is
enough memory to fit all the needed                    In order to discover and correct
metadata for a particular checking pass or     filesystem errors, fsck must read all the
the checker simply aborts.                     metadata in the entire file system. Hence,
                                               the basic idea of our        project is to
        The time required to read the file     introduce parallelism in the operation of
system metadata is partially constrained by    Fsck by pre fetching these metadata
the bandwidth of the disk. Depending on        blocks(which includes inodes, bitmaps,
the file system, some kinds of file system     directory entries, indirect blocks, block
data, such as blocks of statically allocated   group summaries, etc) and simultaneously
inodes or block group summaries, are           performing consistency checks on this pre
located in contiguous chunks at known          fetched data.
locations. Reading this data off disk is
relatively swift.                                      Originally, Fsck consists of a
                                               single thread of operation. To enhance
         Other kinds of file system data are   Fsck performance, we have added an extra
dynamically allocated, such as directory       thread to read ahead the indirect blocks.
entries, indirect blocks, and extents, and     Thus, the project involves two threads
hence are scattered all over the disk. Many    namely - ‗main thread‘ and a ‗prefetch
modern file systems allocate nearly all        thread‘. The main thread is responsible for
their metadata dynamically. The location       the actual data checking while the prefetch
of much of this kind of metadata is not        thread fetches the metadata (indirect
known until the block of metadata pointing     blocks) for the main thread. This
to it is read off the disk, introducing many   modification would ensure that when the
levels of dependent reads. This portion of     main thread begins its checking operation,
the file system check is usually the most      the metadata that it would require is
time consuming, as we must issue a set of      already brought into the system cache by
small scattered reads, wait for them to        the prefetch thread. Hence, there would be
complete, read the address of the next         a reduction in the overall time taken by
block, then issue another set of reads.        Fsck to complete its operations.

        Finally, we need CPU time and                  While actual fetching the data from
sufficient memory to actually compute the      the disk, the prefetch thread has to go to
consistency checks on the data we have         the disk many times and each time the data
brought into the cache will be minimal. As             As mentioned earlier, when the
a solution to this, we have designed a         main fsck thread begins its operation, it
strategy to reduce the number of disk seeks    needs to fetch the metadata from the disk.
for indirect blocks and also to increase the   This data is then checked for consistency.
amount of data brought in each time we go      In other words, while the metadata is being
to the disk. This strategy involves queuing    fetched no other checking is performed
the block numbers to be brought in until       and CPU remains idle. This is a major
we reach the end of a block group and          bottleneck which leads to fsck taking
once the end is reached, issue these queued    enormous time to check and repair the file
IO‘s to fetch the blocks from disk. Also,      system. As a solution to this we have
by merging the IO requests, we ensure that     implemented a multi-threaded model in
during each disk seek maximum data can         which we create a new thread to perform
be pre fetched into the cache instead of       the fetching of metadata. This thread what
issuing single IO‘s.                           we call as ‗Pre-fetch‘ thread performs the
                                               task of pre-fetching metadata from disk
         Another aspect of the project is      and making it available to the main fsck
metadata clustering. Metaclustering refers     thread for performing usual consistency
to storing indirect blocks in clusters on a    checks. In this way, we can ensure
per-group basis instead of spreading them      maximum CPU utilization by co-
out along with the data blocks. This makes     ordinating the operation of both the main
fsck faster since it can now read and verify   and also the pre-fetch thread. Since the
all indirect blocks without much seeks.        prefetch thread reads in all the metadata
                                               that the main thread requires, the main
        Fsck involves five passes. Pass 1 is
                                               thread is absolved of the fetching work. As
responsible for checking inodes, blocks
                                               a result, the checking and fetching of data
and sizes and pass 2 for checking directory
                                               can take place simultaneously thus
structure. Implementing the above
                                               ensuring performance benefits with
mentioned features helps us achieve
                                               regards to time taken for the overall
reduction in the times for these two passes
                                               working of fsck.
which take up maximum time as compared
to the other passes. Hence, by our
modifications and additions to the original       5.2 Working of Pre-fetch thread
Fsck utility we can ensure improvement in
the overall performance of Fsck.                       As the name suggests, the prefetch
                                               thread has been introduced to pre-fetch or
                                               read-ahead the metadata for the main
5. Implementation                              thread. We have added two new queues
                                               namely a ‗Select Queue‘ and an ‗IO
                                               Queue‘ which forms an integral part of the
   5.1 Parallelization Operation
                                               prefetch procedure.

                                                      The working of prefetch thread and
                                               the queues can be better understood from
                                               fig. and can be summarised as follows:

                                               1. Initially the inode table location on
                                                  disk i.e. the block number holding the
                                                  inode table is read into the IO queue.
       Figure 1: Multithreaded Fsck
                                               2. This inode table block is then actually
                                                  fetched from disk into the buffer cache.
Figure 2: Pre-fetch thread working

3. From this table, individual inodes are       The above implementation provides the
   picked up to perform various                 following benefits:
   consistency checks.
                                                1. The main fsck thread performs only
4. For each inode in the table, the                checking of metadata and does not
   prefetch thread fetches the indirect            have to go to the disk to fetch any
   block numbers associated with it into           blocks as all the blocks required by the
   the select queue. Thus, we see that,            main thread have been pre-fetched into
   select queue holds the indirect block           the system cache by the pre-fetch
   numbers of every inode in the current           thread.
   inode table.
                                                2. The select queue that holds the indirect
5. Once the end of block group is                  block numbers is sorted. This helps us
   reached, the select queue is sorted as          achieve a nearly sequential sweep of
   per block numbers and merging is                read-write head over the disk.
   performed to club together contiguous
   block numbers. Then all those block          3. By merging the requests in the select
   numbers that lie within the current             queue, we reduce the number of times
   block group are transferred into the IO         the prefetch thread needs to go the disk
   queue. Thus, the IO queue holds all             to fetch blocks. Thus we ensure
   those block numbers that are to be              minimal in-ordered seeks and also
   currently fetched from disk.                    minimal overall fetches from disk.

6. Finally, the indirect blocks indicated       4. Only those block numbers that lie
   by the blocks numbers present in the            within the current block group are held
   IO queue are fetched from disk into the         by the IO queue. These blocks are then
   buffer cache. Thus, the required                fetched from disk. Thus, we limit our
   metadata blocks become available to             fetching to the current block group
   main thread.                                    while delaying fetching those indirect
                                                   blocks that lie in other block groups.
                                                   This further helps in reducing random
movement of read-write head over            fetch these blocks which otherwise are
   disk.                                       spread out across the filesystem.

   5.3 Metadata Clustering
                                               6. Performance Evaluation
          Every block group has at its end
a semi-reserved region called the                 6.1 Test Environment
‗Metacluster‘. This region is mostly used
                                                           Processor: Intel Core 2 Duo
for allocating indirect blocks. Under
                                                            (2.6 GHz)
normal circumstances, the metacluster is
used only for allocating indirect blocks                   Memory: 512 MB
which are allocated in decreasing order of                 Operating System: Kubuntu
block numbers. The non-Metacluster                          Gutsy Gibbon 7.10
region is used for data block allocations                  Kernel: 2.6.23
which are allocated in increasing order of                 Base File System Checker
block numbers. However, when the MC                         Code: e2fsprogs-1.40.4
runs out of space, indirect blocks can be
allocated in the non-MC region along with      Number of disks used         2+1
the data blocks in the forward direction.      Type of disks                IDE
Similarly, when non-MC runs out of             Size of each disk            40 GB
space, new data blocks are allocated in MC     Partition size               80 GB (RAID 0)
but in the forward direction.
                                               Avg. Seek Time               9 ms
                                               Avg. Rotational Latency      6 ms

                                                Table 2: Experiment Disk Characteristics
                                                             (1 GB = 10^9 bytes)

        Figure 3: Metaclustering                 6.2 Time vs Percentage of file system
                                                     under use
The steps involved in metadata clustering
are:                                           We measured the time taken to run fsck on
                                               our test machine with a gradual increase in
1. Read in the inode table into the buffer     the percentage of file system used. We
cache.                                         observed that by using the readahead
2. Scan through each inode present in the      concept to fetch the data, we get about 30 -
table.                                         35% decrease in the time taken to
3. Find the number of indirect blocks          complete an fsck run.
indicated by the inodes.
4. Find the amount of contiguous free                     % filled Original Modified
space (metacluster region) required to
cluster or group together these indirect
                                                          15        99.62      66.26
blocks.
                                                          25        198.52 132.69
5. Transfer the indirect blocks into the
                                                          35        303.1      215.65
metacluster region.
6. Perform required updation of metadata                  45        331.88 231.46
to reflect above changes to the file system.              55        412.32 290.87
                                                          65        475.72 365.14
Such clustering helps to club together the                75        526.39 397.67
indirect blocks and hence reduces seeks to                Table 3: Time taken to run fsck
affect fsck time in 2013, we ran fsck on the
                                               /dev/md0 partition (ext3 formatted) of a
                                               desktop machine (see Table 1 for details
                                               on the disk used).
                                               We measured the elapsed time and CPU
                                               time using the time command, and the
                                               number of individual I/O operations and
                                               the total data read using iostat. Using this
                                               data and projected changes in disk
                                               hardware, we made a rough estimate of the
                                               time needed to complete a file system
                                               check on a moderately sized desktop
                                               (RAID 0) file system in 2013.

                                                  First, we estimated how the using the
                                                   CPU, reading data off the disk, and
                                                   head travel between blocks (seek time
                                                   plus rotational latency).
        Graph 1: Time vs. FS usage (%)             To check an 80 GB file system with 60
                                                   GB of data:
   6.3 Finding sequential order break with          The total elapsed time of original
       respect to the file system usage                fsck is 527 seconds, 21 seconds of
                                                       which are spent in CPU time. That
We also measured the number of times                   leaves 506 seconds in I/O.
fsck needs to fetch blocks which are                The total elapsed time of modified
inordered i.e. which break the sequential              fsck is 398 seconds, 16 seconds of
movement of the read-write head and                    which are spent in CPU time. That
found that in the original fsck, the number            leaves 382 seconds in I/O.
of inordered reads is quite high. This count
of the number of inordered reads reduces          We measured 1.5GB of data read. We
considerably by using the modified fsck.           estimated the amount of time to read
The actual count can be seen as follows:           1.5 GB of data off the disk by using dd
                                                   to transfer that amount of data from the
File       Original Modified                       partition into /dev/null, which took 37
System                                             seconds at optimal read size.
used (in                                          The remaining time, 490 seconds (for
GB)                                                original fsck) and 345seconds (for
10         29         5                            modified fsck), we assume is spent
20         47         26                           seeking between tracks and waiting for
30         89         40                           the data to rotate under the head.
40         107        49                          We measured 233,440 separate I/O
50         136        61                           requests.
                                                  The average seek time for this disk is
Table 4: Number of in-ordered reads                9 ms, and the average rotational
                                                   latency is 6 ms. We estimate that
   6.4 Finding seek per number of IOs              original fsck required about 32000
                                                   seeks (about one seek per every 35-38
To get a rough estimate of how 16x                 I/Os) while the modified fsck required
capacity increase, 5x bandwidth increase,          about 22000 seeks (about one seek per
and almost no change in seek time will             every 58-60 I/Os).
Steps to find IOs/seek:                        [1] Valerie Henson, Open Source
                                               Technology Centre, Intel Corporation.
1. Elapsed time = CPU time + time to           Repair – Driven File System Design
   read data of the disk + head travel
   between the blocks(seek time +              [2] Val Henson, Zach Brown,
   rotational latency)                         Theodore Ts‘o, Arjan Van De Ven.
2. Calculate total elapsed time for e2fsck     Reducing Fsck time for Ext2 File
3. Subtract CPU time from it to get the        Sytems. In Ottawa linux Symposiu
   input/output                        time    2006, 2006.
   Note: [CPU time = user time + system
   time] (time command used)                   [3] Val Henson, Arjan van de Ven,
4. Calculate time required to read certain     Amit, Gud, and Zach Brown. Chunkfs:
   amount of metadata (approximately =         Using divide-and-conquer to improve
   current       metadata      for     test)   file system reliability and repair. In
   Note: use dd operation to calculate         Hot Topics in System Dependability,
5. Subtract the above time from I/O time       2006.
   to get time spent in head travel
6. Divide this time by the disk access         [4] Design and Implementation of the
   time                                        second      extended     file   system.
   Note: Access time = seek time +             http://e2fsprogs.sourceforge.net/ext2int
   rotational latency                          ro.html
7. Calculate number of I/O requests for
   the test                                    [5] T.J.Kowalski and Marshall K.
8. Divide this by the time calculated in       McKusick. Fsck – the UNIX file
   step 6 to get the number of I/O‘s           system check program. Technical
   required for one seek                       report, Bell Laboratories, 1978.

7. Conclusion

     This paper enunciates the design and
implementation for a multithreaded
filesystem checker (fsck), an improvement
over the current single threaded version. It
also describes the extensions implemented
in the current fsck, to enable clustering of
metadata to further improve performance.
The sample tests performed using FSCK –
SX have shown its capability to achieve
nearly a 30% enhancement over current
performance. FSCK – SX thus enables a
framework to achieve an optimized
filesystem checker on ext3 filesystem, the
concept of which can be extended to other
file systems.



8. References

More Related Content

What's hot

White Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset MetadataWhite Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset MetadataPerforce
 
Bt0070 operating systems 2
Bt0070 operating systems  2Bt0070 operating systems  2
Bt0070 operating systems 2Techglyphs
 
b tree file system report
b tree file system reportb tree file system report
b tree file system reportDinesh Gupta
 
RH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux CertificationRH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux CertificationIsabella789
 
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)Isabella789
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
SSD WhitePaper by Houman Shabani
SSD WhitePaper  by Houman ShabaniSSD WhitePaper  by Houman Shabani
SSD WhitePaper by Houman ShabaniHouman Shabani
 
I/O System and Case Study
I/O System and Case StudyI/O System and Case Study
I/O System and Case StudyGRamya Bharathi
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case studyLavanya G
 
Dbm 438 Enthusiastic Study / snaptutorial.com
Dbm 438 Enthusiastic Study / snaptutorial.comDbm 438 Enthusiastic Study / snaptutorial.com
Dbm 438 Enthusiastic Study / snaptutorial.comStephenson23
 
User level view of os
User level view of osUser level view of os
User level view of osMohd Arif
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 

What's hot (16)

White Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset MetadataWhite Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
 
Bt0070 operating systems 2
Bt0070 operating systems  2Bt0070 operating systems  2
Bt0070 operating systems 2
 
Backups
BackupsBackups
Backups
 
b tree file system report
b tree file system reportb tree file system report
b tree file system report
 
RH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux CertificationRH302 Exam-Red Hat Linux Certification
RH302 Exam-Red Hat Linux Certification
 
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
RH-302 Exam-Red Hat Certified Engineer on Redhat Enterprise Linux 4 (Labs)
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
SSD WhitePaper by Houman Shabani
SSD WhitePaper  by Houman ShabaniSSD WhitePaper  by Houman Shabani
SSD WhitePaper by Houman Shabani
 
I/O System and Case Study
I/O System and Case StudyI/O System and Case Study
I/O System and Case Study
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Sucet os module_4_notes
Sucet os module_4_notesSucet os module_4_notes
Sucet os module_4_notes
 
Gt3112931298
Gt3112931298Gt3112931298
Gt3112931298
 
I/O System and Case study
I/O System and Case studyI/O System and Case study
I/O System and Case study
 
Dbm 438 Enthusiastic Study / snaptutorial.com
Dbm 438 Enthusiastic Study / snaptutorial.comDbm 438 Enthusiastic Study / snaptutorial.com
Dbm 438 Enthusiastic Study / snaptutorial.com
 
User level view of os
User level view of osUser level view of os
User level view of os
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 

Viewers also liked (15)

Engineering in Global Economies-Economic Issues
Engineering in Global Economies-Economic IssuesEngineering in Global Economies-Economic Issues
Engineering in Global Economies-Economic Issues
 
DPU K2
DPU K2DPU K2
DPU K2
 
نقل الأعضاء
نقل الأعضاءنقل الأعضاء
نقل الأعضاء
 
Final linda dulye nj iabc chapter ultimate engagement 110811
Final linda dulye nj iabc chapter ultimate engagement 110811Final linda dulye nj iabc chapter ultimate engagement 110811
Final linda dulye nj iabc chapter ultimate engagement 110811
 
cs01-cmrsgalleryfulldisplay
cs01-cmrsgalleryfulldisplaycs01-cmrsgalleryfulldisplay
cs01-cmrsgalleryfulldisplay
 
Loviit: the e-commerce payments landscape in Italy 2015
Loviit: the e-commerce payments landscape in Italy 2015Loviit: the e-commerce payments landscape in Italy 2015
Loviit: the e-commerce payments landscape in Italy 2015
 
Guia 6.
Guia 6.Guia 6.
Guia 6.
 
Faithe kazmark powerpoint
Faithe kazmark powerpointFaithe kazmark powerpoint
Faithe kazmark powerpoint
 
Ako vybrať ten správny koberec?
Ako vybrať ten správny koberec?Ako vybrať ten správny koberec?
Ako vybrať ten správny koberec?
 
Sygic GPS apps
Sygic GPS appsSygic GPS apps
Sygic GPS apps
 
Genesi, il diluvio
Genesi, il diluvioGenesi, il diluvio
Genesi, il diluvio
 
Mensajes subliminales en la publicidad
Mensajes subliminales en la publicidadMensajes subliminales en la publicidad
Mensajes subliminales en la publicidad
 
Safepreg
SafepregSafepreg
Safepreg
 
Urban Academy
Urban AcademyUrban Academy
Urban Academy
 
234732427 income-tax-cases
234732427 income-tax-cases234732427 income-tax-cases
234732427 income-tax-cases
 

Similar to Fsck Sx

Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systemsSri Prasanna
 
File Management in Operating Systems
File Management in Operating SystemsFile Management in Operating Systems
File Management in Operating Systemsvampugani
 
Google File System
Google File SystemGoogle File System
Google File Systemvivatechijri
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systemstugrulh
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)Sri Prasanna
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Operating Systems - Implementing File Systems
Operating Systems - Implementing File SystemsOperating Systems - Implementing File Systems
Operating Systems - Implementing File SystemsMukesh Chinta
 
Rhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformanceRhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformancesprdd
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file systemLalit Rastogi
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File SystemNtu
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file systemGang He
 

Similar to Fsck Sx (20)

XFS.ppt
XFS.pptXFS.ppt
XFS.ppt
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 
File Management in Operating Systems
File Management in Operating SystemsFile Management in Operating Systems
File Management in Operating Systems
 
Google File System
Google File SystemGoogle File System
Google File System
 
Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systems
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
 
Operating system
Operating systemOperating system
Operating system
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Operating Systems - Implementing File Systems
Operating Systems - Implementing File SystemsOperating Systems - Implementing File Systems
Operating Systems - Implementing File Systems
 
Rhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformanceRhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformance
 
ZFS
ZFSZFS
ZFS
 
os
osos
os
 
Advance google file system
Advance google file systemAdvance google file system
Advance google file system
 
Operating system
Operating systemOperating system
Operating system
 
Lesson 2
Lesson 2Lesson 2
Lesson 2
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File System
 
191
191191
191
 
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
 
Learn about log structured file system
Learn about log structured file systemLearn about log structured file system
Learn about log structured file system
 

Fsck Sx

  • 1. FSCK SX An approach to FSCK Performance Enhancement Gaurav Naigaonkar, Sanjyot, Tipnis, Ajay Mandvekar, Moksh Matta {gnaigaonkar, sanj312} @gmail.com {ajay30_sam, mokshmatta1004} @hotmail.com Pune Institute of Computer Technology, Pune Abstract 1. Introduction File System Check Program (fsck) is an interactive file system check and File system repair is usually an repair program. Fsck uses the redundant afterthought for file system designers. One structural information in the UNIX file reason is that repair is difficult and system to perform several consistency annoying to reason about. It‘s neither checks. Unfortunately, disk capacity is possible nor worthwhile to fix all error growing faster than disk bandwidth, seek modes so we must focus our efforts on the times are hardly budging, and the overall ones that commonly occur, yet we do not chance of an I/O error occurring know what they are until we encounter somewhere on the disk is increasing. The them in the wild. In practice, most file result: the traditional file system check and system repair code is written in response to repair cycle will be not only longer, but an observed corruption mode. File system also more frequent, with disastrous repair is annoying because, by definition, consequences for data availability. Data something went wrong and we must think reliability will also decline with frequency outside the state space of our beautifully of corruption. designed system. In the end, designing a file system is more fun than designing a We have implemented techniques for file system checker. reducing the average fsck time on ext3 file systems. First, we improve the For many years, we could brush off performance by parallelizing the two the importance of making file system major operations of fsck – fetching repair fast and reliable with the following metadata and checking it. This chain of reasoning: File system corruption multithreaded operation, along with an is a rare event, and when it does occur, intelligent issuing of IO requests, helps to repairing it takes only a few minutes or greatly reduce the overall seek time. We maybe a few hours of downtime, and if have also implemented ‗Metaclustering‘ repair is too difficult or time–consuming, wherein we store the indirect blocks in ―That‘s what backups are for.‖ clusters on a per group basis instead of Unfortunately, if this reasoning was ever spreading them out along with the data valid, it is being eroded by some blocks. This makes fsck even faster inconvenient truths about disk hardware since it can now read and verify all trends. indirect blocks without much seek. 2006 2009 2013 Change Capacity (GB) 500 2000 8000 16x Keywords: Fsck, Parallelism, Metadata Bandwidth(Mb/s) 1000 2000 5000 5x clustering, Readahead Seek Time(ms) 8 7.2 6.5 1.2x Table 1: Projected disk hardware trends
  • 2. As Table 1 shows, Seagate projects checks to see if the ext3 file system was that during the same time that disk cleanly un-mounted by reading the state capacity increases by 16 times, disk field in the file system superblock. If the bandwidth will increase by only 5 times, state is set as VALID, the file system is and seek times will remain nearly already consistent and does not need unchanged. This is good news for many recovery; fsck exits without further ado. If common workloads—we can store more the state is INVALID, fsck does a full data and read and write more of it at once. check of the file system integrity, repairing But it is terrible news for any workload any inconsistencies it finds. In order to that is (a) proportional to the size of the check the correctness of allocation disk, (b) throughput–intensive, and (c) bitmaps, file nlinks, directory entries, etc., seek–intensive. fsck reads every inode in the system, every indirect block referenced by an inode, and One workload that fits this profile every directory entry. Using this is file system check and repair. It has been information, it builds up a new set of inode calculated that file system check and repair and block allocation bitmaps, calculates time will increase by approximately a the correct number of links of every inode, factor of 10 between 2006 and 2013 with and removes directory entries to today‘s file system formats. unreferenced inodes. It does many other things as well, such as sanity check inode At the same time that capacity is fields, but these three activities increasing, the per–bit error rate is fundamentally require reading every inode improving. However, for an overall in the file system. Otherwise, there is no improvement in the error rate for way to find out whether, for example, a operations that read data proportional to particular block is referenced by a file but the file system size (such as fsck), the per– is marked as unallocated on the block bit error rate must improve as fast as allocation bitmap. In summary, there are capacity grows, which seems unlikely. We no back pointers from a data block to the conclude that the frequency of file system indirect block that points to it or from a corruption and necessary check and repair file to the directories that point to it, so the or restore is more likely to increase than only way to reconstruct reference counts is decrease. This combination of increasing to start at the top level and build a fsck time and increasing fsck frequency is complete picture of the file system what we call the fsck time crunch. metadata. 2. The fsck program Unsurprisingly, it takes fsck quite some time to rebuild the entirety of the file Cutting down crash recovery time system metadata, approximately O(total for an ext3 file system depends on file system size + data stored). The understanding how the file system checker average laptop takes several minutes to program, fsck works. After Linux has fsck an ext2 file system; large file servers finished booting the kernel, the root file can sometimes take hours or, on occasion, system is mounted read-only and the days! kernel executes the init program. As part of normal system initialization, fsck is run Straightforward tactical on the root file system before it is performance optimizations such as remounted read-write and on other file requesting reads of needed blocks in systems before they are mounted. Repair sequential order and readahead requests of the file system is necessary before it can can only improve the situation so much, be safely written. When fsck runs, it given that the whole operation will still
  • 3. take time proportional to the entire file read off disk. This takes a relatively small system. What we want is file system amount of time compared to the time spent recovery time that is O(writes in progress), doing what are effectively random 4 KB or as is the case for journal replay in similar–sized reads, although more journaling file systems. complex file systems may burn more CPU time in computing checksums or similar 3. Motivation tasks. In summary, the ways to reduce fsck time, in rough order of effectiveness, are to reduce seeks, reduce dependent reads, The fundamental limiting factors in reduce the amount of metadata that needs the performance of fsck are amount of data to be read (either by reducing the overall read, number of separate I/Os, how quantity or the amount that needs to be scattered the data is on disk, number of read), and to reduce the complexity of the dependent reads, and CPU time required to consistency checks themselves. check and repair the data read. The amount of memory available is a factor as well, though most fsck programs operate on an 4. Our Approach all–or–nothing basis: Either there is enough memory to fit all the needed In order to discover and correct metadata for a particular checking pass or filesystem errors, fsck must read all the the checker simply aborts. metadata in the entire file system. Hence, the basic idea of our project is to The time required to read the file introduce parallelism in the operation of system metadata is partially constrained by Fsck by pre fetching these metadata the bandwidth of the disk. Depending on blocks(which includes inodes, bitmaps, the file system, some kinds of file system directory entries, indirect blocks, block data, such as blocks of statically allocated group summaries, etc) and simultaneously inodes or block group summaries, are performing consistency checks on this pre located in contiguous chunks at known fetched data. locations. Reading this data off disk is relatively swift. Originally, Fsck consists of a single thread of operation. To enhance Other kinds of file system data are Fsck performance, we have added an extra dynamically allocated, such as directory thread to read ahead the indirect blocks. entries, indirect blocks, and extents, and Thus, the project involves two threads hence are scattered all over the disk. Many namely - ‗main thread‘ and a ‗prefetch modern file systems allocate nearly all thread‘. The main thread is responsible for their metadata dynamically. The location the actual data checking while the prefetch of much of this kind of metadata is not thread fetches the metadata (indirect known until the block of metadata pointing blocks) for the main thread. This to it is read off the disk, introducing many modification would ensure that when the levels of dependent reads. This portion of main thread begins its checking operation, the file system check is usually the most the metadata that it would require is time consuming, as we must issue a set of already brought into the system cache by small scattered reads, wait for them to the prefetch thread. Hence, there would be complete, read the address of the next a reduction in the overall time taken by block, then issue another set of reads. Fsck to complete its operations. Finally, we need CPU time and While actual fetching the data from sufficient memory to actually compute the the disk, the prefetch thread has to go to consistency checks on the data we have the disk many times and each time the data
  • 4. brought into the cache will be minimal. As As mentioned earlier, when the a solution to this, we have designed a main fsck thread begins its operation, it strategy to reduce the number of disk seeks needs to fetch the metadata from the disk. for indirect blocks and also to increase the This data is then checked for consistency. amount of data brought in each time we go In other words, while the metadata is being to the disk. This strategy involves queuing fetched no other checking is performed the block numbers to be brought in until and CPU remains idle. This is a major we reach the end of a block group and bottleneck which leads to fsck taking once the end is reached, issue these queued enormous time to check and repair the file IO‘s to fetch the blocks from disk. Also, system. As a solution to this we have by merging the IO requests, we ensure that implemented a multi-threaded model in during each disk seek maximum data can which we create a new thread to perform be pre fetched into the cache instead of the fetching of metadata. This thread what issuing single IO‘s. we call as ‗Pre-fetch‘ thread performs the task of pre-fetching metadata from disk Another aspect of the project is and making it available to the main fsck metadata clustering. Metaclustering refers thread for performing usual consistency to storing indirect blocks in clusters on a checks. In this way, we can ensure per-group basis instead of spreading them maximum CPU utilization by co- out along with the data blocks. This makes ordinating the operation of both the main fsck faster since it can now read and verify and also the pre-fetch thread. Since the all indirect blocks without much seeks. prefetch thread reads in all the metadata that the main thread requires, the main Fsck involves five passes. Pass 1 is thread is absolved of the fetching work. As responsible for checking inodes, blocks a result, the checking and fetching of data and sizes and pass 2 for checking directory can take place simultaneously thus structure. Implementing the above ensuring performance benefits with mentioned features helps us achieve regards to time taken for the overall reduction in the times for these two passes working of fsck. which take up maximum time as compared to the other passes. Hence, by our modifications and additions to the original 5.2 Working of Pre-fetch thread Fsck utility we can ensure improvement in the overall performance of Fsck. As the name suggests, the prefetch thread has been introduced to pre-fetch or read-ahead the metadata for the main 5. Implementation thread. We have added two new queues namely a ‗Select Queue‘ and an ‗IO Queue‘ which forms an integral part of the 5.1 Parallelization Operation prefetch procedure. The working of prefetch thread and the queues can be better understood from fig. and can be summarised as follows: 1. Initially the inode table location on disk i.e. the block number holding the inode table is read into the IO queue. Figure 1: Multithreaded Fsck 2. This inode table block is then actually fetched from disk into the buffer cache.
  • 5. Figure 2: Pre-fetch thread working 3. From this table, individual inodes are The above implementation provides the picked up to perform various following benefits: consistency checks. 1. The main fsck thread performs only 4. For each inode in the table, the checking of metadata and does not prefetch thread fetches the indirect have to go to the disk to fetch any block numbers associated with it into blocks as all the blocks required by the the select queue. Thus, we see that, main thread have been pre-fetched into select queue holds the indirect block the system cache by the pre-fetch numbers of every inode in the current thread. inode table. 2. The select queue that holds the indirect 5. Once the end of block group is block numbers is sorted. This helps us reached, the select queue is sorted as achieve a nearly sequential sweep of per block numbers and merging is read-write head over the disk. performed to club together contiguous block numbers. Then all those block 3. By merging the requests in the select numbers that lie within the current queue, we reduce the number of times block group are transferred into the IO the prefetch thread needs to go the disk queue. Thus, the IO queue holds all to fetch blocks. Thus we ensure those block numbers that are to be minimal in-ordered seeks and also currently fetched from disk. minimal overall fetches from disk. 6. Finally, the indirect blocks indicated 4. Only those block numbers that lie by the blocks numbers present in the within the current block group are held IO queue are fetched from disk into the by the IO queue. These blocks are then buffer cache. Thus, the required fetched from disk. Thus, we limit our metadata blocks become available to fetching to the current block group main thread. while delaying fetching those indirect blocks that lie in other block groups. This further helps in reducing random
  • 6. movement of read-write head over fetch these blocks which otherwise are disk. spread out across the filesystem. 5.3 Metadata Clustering 6. Performance Evaluation Every block group has at its end a semi-reserved region called the 6.1 Test Environment ‗Metacluster‘. This region is mostly used  Processor: Intel Core 2 Duo for allocating indirect blocks. Under (2.6 GHz) normal circumstances, the metacluster is used only for allocating indirect blocks  Memory: 512 MB which are allocated in decreasing order of  Operating System: Kubuntu block numbers. The non-Metacluster Gutsy Gibbon 7.10 region is used for data block allocations  Kernel: 2.6.23 which are allocated in increasing order of  Base File System Checker block numbers. However, when the MC Code: e2fsprogs-1.40.4 runs out of space, indirect blocks can be allocated in the non-MC region along with Number of disks used 2+1 the data blocks in the forward direction. Type of disks IDE Similarly, when non-MC runs out of Size of each disk 40 GB space, new data blocks are allocated in MC Partition size 80 GB (RAID 0) but in the forward direction. Avg. Seek Time 9 ms Avg. Rotational Latency 6 ms Table 2: Experiment Disk Characteristics (1 GB = 10^9 bytes) Figure 3: Metaclustering 6.2 Time vs Percentage of file system under use The steps involved in metadata clustering are: We measured the time taken to run fsck on our test machine with a gradual increase in 1. Read in the inode table into the buffer the percentage of file system used. We cache. observed that by using the readahead 2. Scan through each inode present in the concept to fetch the data, we get about 30 - table. 35% decrease in the time taken to 3. Find the number of indirect blocks complete an fsck run. indicated by the inodes. 4. Find the amount of contiguous free % filled Original Modified space (metacluster region) required to cluster or group together these indirect 15 99.62 66.26 blocks. 25 198.52 132.69 5. Transfer the indirect blocks into the 35 303.1 215.65 metacluster region. 6. Perform required updation of metadata 45 331.88 231.46 to reflect above changes to the file system. 55 412.32 290.87 65 475.72 365.14 Such clustering helps to club together the 75 526.39 397.67 indirect blocks and hence reduces seeks to Table 3: Time taken to run fsck
  • 7. affect fsck time in 2013, we ran fsck on the /dev/md0 partition (ext3 formatted) of a desktop machine (see Table 1 for details on the disk used). We measured the elapsed time and CPU time using the time command, and the number of individual I/O operations and the total data read using iostat. Using this data and projected changes in disk hardware, we made a rough estimate of the time needed to complete a file system check on a moderately sized desktop (RAID 0) file system in 2013.  First, we estimated how the using the CPU, reading data off the disk, and head travel between blocks (seek time plus rotational latency). Graph 1: Time vs. FS usage (%) To check an 80 GB file system with 60 GB of data: 6.3 Finding sequential order break with  The total elapsed time of original respect to the file system usage fsck is 527 seconds, 21 seconds of which are spent in CPU time. That We also measured the number of times leaves 506 seconds in I/O. fsck needs to fetch blocks which are  The total elapsed time of modified inordered i.e. which break the sequential fsck is 398 seconds, 16 seconds of movement of the read-write head and which are spent in CPU time. That found that in the original fsck, the number leaves 382 seconds in I/O. of inordered reads is quite high. This count of the number of inordered reads reduces  We measured 1.5GB of data read. We considerably by using the modified fsck. estimated the amount of time to read The actual count can be seen as follows: 1.5 GB of data off the disk by using dd to transfer that amount of data from the File Original Modified partition into /dev/null, which took 37 System seconds at optimal read size. used (in  The remaining time, 490 seconds (for GB) original fsck) and 345seconds (for 10 29 5 modified fsck), we assume is spent 20 47 26 seeking between tracks and waiting for 30 89 40 the data to rotate under the head. 40 107 49  We measured 233,440 separate I/O 50 136 61 requests.  The average seek time for this disk is Table 4: Number of in-ordered reads 9 ms, and the average rotational latency is 6 ms. We estimate that 6.4 Finding seek per number of IOs original fsck required about 32000 seeks (about one seek per every 35-38 To get a rough estimate of how 16x I/Os) while the modified fsck required capacity increase, 5x bandwidth increase, about 22000 seeks (about one seek per and almost no change in seek time will every 58-60 I/Os).
  • 8. Steps to find IOs/seek: [1] Valerie Henson, Open Source Technology Centre, Intel Corporation. 1. Elapsed time = CPU time + time to Repair – Driven File System Design read data of the disk + head travel between the blocks(seek time + [2] Val Henson, Zach Brown, rotational latency) Theodore Ts‘o, Arjan Van De Ven. 2. Calculate total elapsed time for e2fsck Reducing Fsck time for Ext2 File 3. Subtract CPU time from it to get the Sytems. In Ottawa linux Symposiu input/output time 2006, 2006. Note: [CPU time = user time + system time] (time command used) [3] Val Henson, Arjan van de Ven, 4. Calculate time required to read certain Amit, Gud, and Zach Brown. Chunkfs: amount of metadata (approximately = Using divide-and-conquer to improve current metadata for test) file system reliability and repair. In Note: use dd operation to calculate Hot Topics in System Dependability, 5. Subtract the above time from I/O time 2006. to get time spent in head travel 6. Divide this time by the disk access [4] Design and Implementation of the time second extended file system. Note: Access time = seek time + http://e2fsprogs.sourceforge.net/ext2int rotational latency ro.html 7. Calculate number of I/O requests for the test [5] T.J.Kowalski and Marshall K. 8. Divide this by the time calculated in McKusick. Fsck – the UNIX file step 6 to get the number of I/O‘s system check program. Technical required for one seek report, Bell Laboratories, 1978. 7. Conclusion This paper enunciates the design and implementation for a multithreaded filesystem checker (fsck), an improvement over the current single threaded version. It also describes the extensions implemented in the current fsck, to enable clustering of metadata to further improve performance. The sample tests performed using FSCK – SX have shown its capability to achieve nearly a 30% enhancement over current performance. FSCK – SX thus enables a framework to achieve an optimized filesystem checker on ext3 filesystem, the concept of which can be extended to other file systems. 8. References