SlideShare a Scribd company logo
1 of 31
Download to read offline
Tuning OLTP RAID IO

   Paul Tuckfield
I'm talking OLTP, and just IO
• You need to make sure you've done
  other things first, and have minimized
  your IO load, and arrived at the point
  where you must improve IO performance
• You bought a multidisk controller, you
  should make sure it's actually using
  multiple disks concurrently.
4 ways to make OLTP IO faster
0.) Do not do IO for the query
1.) Do IO only for the rows the query needs
2.) if you must do lots of IO, do it sequentially (read
ahead) but only in the DB not in the fs or raid
3.) make sure you're really doing concurrent IO to
multiple spindles (many things can prevent this)
digression: Sometimes it's faster to stream a whole table from disk, ignore most of
the blocks, even tho an index might have gotten ony the blocks you want. Where is
the tradeoff point? in OLTP almost never, because it flushes caches
4 ways to make OLTP IO faster

0.) Do not do IO <- Tune your queries first
1.) Do as few IO as possible <- Tune DB cache first
2.) if you must do IO, do it sequentially <- not OLTP workload,
make sure filesytem and raid arent tying to “help” you by
doing readahead
3.) make sure you're really doing concurrent IO <- know when
you're serializing or getting concurrent IO then tune IO stack



digression: flash drives. how will they change this?
Generic rules for OLTP IO
             tuning
• Never cache anything below the DB
• Cache only writes in the raid controller (the sole
  exception to rule 1, assuming battery backed raid
  cache)
• Never do read ahead anywhere but in the DB
  similar to not caching outside db but then is the
  DB doing the right thing?
• Make sure your stripe/chunk size is much larger
  than db blocksize
• make sure you're actually doing concurrent IO
Only DB should cache reads
       In a reasonable InnoDB setup, your db cache will be smaller than
       you disk based dataset, the filesystem cache (if you insisted on
       having one against my advice) should be smaller, the raid cache will
       be smaller still (ram is cheaper than a disk array) Imagine the db has
       read blocks 1 thru 11 , this is what the caches look like. Each cache
       holds only N most recent blocks.


Db cache            3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

FS cache                      7 | 8 | 9 | 10 | 11


             RAID cache              10 | 11


       0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
Only DB should cache reads
    Suppose a read of any block: If it's in any of these caches, it will also be in the
    DB cache, because the db cache is the largest. therefore the read caches of
    the file system and raid controler will never be used usefully on a cache hit.
    Suppose it's an InnoDB cache miss on the other hand: it won't be in fs or raid
    cache then, since they're smaller and cant have a longer LRU list than
    InnoDB. therefore, they're useles.


Db cache               3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

FS cache                        7 | 8 | 9 | 10 | 11


                RAID cache             10 | 11


          0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
Only the DB should readahead
• the db knows or at least has far better heuristic
  data to know when it needs to read ahead( Innodb
  method is questionable IMO)
• filesystem, raid are making much weaker guesses
  if they do readahead, it's more or less hard coded,
  not good. nominal case should be a single block.
• When they do readahead, they're essentially doing
  even more read caching, which we've established
  is useless or even bad in these layers.
• Check “avgrq-sz” in iostat to see if some upper
  layer is doing > 1 block (16k in the case of innodb)
  read
Only cache writes in controller
     suppose after reading 0 thru 11, the db modifies blocks 10 and 11.
     the blocks are written to the FS cache then to the raid cache. Until
     acknolegments come back, theyre “dirty pages” signified in red




Db cache           3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

FS cache                    7 | 8 | 9 | 10 | 11


             RAID cache           10 | 11


      0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 |11| 12 | 13 | 14 | 15
Only cache writes in controller
     When the raid puts the blocks in ram, it tells the FS that the block is
     written, likewise the fs tells the db it's written. Both layers then
     consider the page to be “clean” signified by turning them black here.




Db cache            3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

FS cache                    7 | 8 | 9 | 10 | 11

                                                 write acks flow back
             RAID cache            10 | 11       up, RAID acks before
                                                 actually writing to dsks


      0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 |11| 12 | 13 | 14 | 15
Only cache writes in controller
  The database can burst lots of writes to the raid cache before they ever get
    committed to disk, which is good: the raid offloads all the waiting and
    housekeeping of writing to persistent storage. Now Suppose a cache
  miss occurs requesting block 14. It can’t be put into cache until block 10 is
                   written, and the LRU is “pushed down”


Db cache             3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

FS cache                     7 | 8 | 9 | 10 | 11


               RAID cache           10 | 11


disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10| 11| 12 | 13 | 14 | 15
Only cache writes in controller
    The block 10 is retired at the cost of 1 IO before 14 is
                        ever accessed.



Db cache           4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 14

FS cache                  8 | 9 | 10 | 11 |14

                                 11 |
              RAID cache


disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
Only cache writes in controller
  Now, 14 can be put into the raid cache, a cost of 1 more
            IO, before sending on to the host



Db cache           4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 14

FS cache                  8 | 9 | 10 | 11 |14

                                 11 | 14
              RAID cache


disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
This is the ideal I seek : no filesystem cache at all, and
   only caching writes on the raid. In the same scenario,
    you can see how the read of block 2 will not serialize
                 behind the writes of block 10.



Db cache           4 | 5 | 6 | 7 | 8 | 9 | 10 | 11|

FS cache


                               10 | 11 RAID cache


disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10| 11| 12 | 13 | 14 | 15
Only cache writes in controller
A real RAID cache is more complex than this example, sometimes involving
high water marks or some scheme for splitting the cache between dirty and
clean pages. But the principle is the same: after a burst of writes , the raid
will finally decide that it shouldn't allow more writes or reads until it can evict
some dirty pages. I like to think of this as “borrowing” from future bandwidth
when doing a burst of writes.

Now, we've already established that the read cache is useless, because any
cache hit will happen in the upper , much larger layers. If dirty pages are at
the tail of an LRU cache, that means any reads will also have to wait for the
dirty page to be written before proceeding (filling a cache buffer in the raid
controler before passing up to the host) therefore, it's great to cache writes
as long as you do not cache reads in the controler.

Some controlers (I think emc for example) may partition dirty and clean
pages, and also partition by spindles, so that reads still dont serialize behind
writes and no one spindle “borrows” too much bandwidth, leaving a cache
full of dirty pages that must be retired to only one spindle for eample. but
again, we've established that raid read cache is useless in a nominal setup,
cause it's always smaller than the host side cache, therefore may as well just
give all the space to writes.
Only cache writes in controller
• Burst writes “borrow” bandwidth by writing to raid
  cache
• at some point, new Ios cannot continue, because
  controller must “pay back” IO freeing up cache
  slots.
• if reads don't go thru this cache they never
  serialize here. Some controllers partition, which is
  good enough.
• Various caching algorithms vary this somewhat,
  but in the end they at best are no better than not
  caching reads, at worst, hurt overall performance
Make stripe size >> blocksize
• suppose you read a 16k block from a
  striped (or raid10) with a small
  chunksize. You “engage” every spindle
  to service one IO.




           16k stripe / 2k chunk
Make stripe size >> blocksize
•   now suppose 3 reads are issued by the host
•   Seek /rotate on all drives (5 ms)
•   Deliver 2k in parallel from all drives (<1ms)
•   Seek /rotate on all drives (5 ms)
•   Deliver 2k in parallel from all drives (<1ms)
•   Seek /rotate on all drives (5 ms)
•   Deliver 2k in parallel from all drives (<1ms)
•   THIS IS SERIALIZATION THAT SHOULD BE PARALLEL




                 16k stripe / 2k chunk
Make stripe size >> blocksize
•   issue 3 concurrent IO : Seek/rotate on 3 random drives drives (5 ms)
•   Deliver 3x16k in parallel from 3 different drives (<1ms): each block “fits”
    on a single spindle.
•   Not that smaller chunk sizes also increase the likelihood of
    misaligned reads engaging 2 spindles instead of 1
•   A write of a given block, of course , engages 2x spindles in a mirrored
    config




                      1M stripe / 128k chunk
Generic rules for OLTP IO
    tuning, with “answers”
• Never cache anything below the DB
  – innodb_flush_method=O_DIRECT
    eliminates the FS cache.
  – If you must use or cant turn off fs cache,
    look into vm.swappiness parameter to
    prevent paging
  – I've seen serialization happen at this layer
    for reasons I dont understand, is suspect
    related to O_DIRECT. but worked around by
    mirroring in hardware and striping in md
Generic rules for OLTP IO
    tuning, with “answers”
• Cache only writes in the raid controller,
  no readahead:
  – adaptec (currently our controller for live
    dbs):
     • /usr/local/sbin/arcconf setcache 1
       logicaldrive 1 roff
     • Implements my ideal of caching writes but not
       reads, and not doing readahead
Generic rules for OLTP IO
    tuning, with “answers”
• Cache only writes in the raid controller,
  no readahead
  – MegRAID (our controler thru
    2006):
     • write Policy = WRBACK
     • Read Policy = NORMAL
     • Cache Policy = DirectIO
  – These settings were what I had before we
    switched to adaptec. Never decided
    whether best to allow read/write cache w/o
    readahead, or turn both off with DirectIO
Generic rules for OLTP IO
    tuning, with “answers”
• Cache only writes in the raid controller,
  no readahead
  – 3ware (I have a 9500S,never put
    a real workload on one)
     • no aparent ability to shut off readahead, not sure
       it's doing readahead
     • can shut of write caching, but not
       read (opposite of what I want)
     • my memory is vague on this
       controler, i used it for a
       fileserver a while back, but did
       token testing on it with database in
       mind
My ideal
 No FS cache at all, and only caching writes in the
controler. with this setup in place, it comes time to
                                                           iostat sees
                  look at IOSTAT
                                                           this layer

Db cache             0 | 1 | 2 | 6 | 7 | 8 | 9 | 10 | 11

FS cache


                                   10 | 11 RAID cache


 disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10| 11| 12 | 13 | 14 | 15
Detecting serialization on
           multidisk arrays
•   I've observed well tuned disk arrays serializing io for some reason,when
    they should be parallelizing (doing multiple Ios to multiple disks
    concurrentl) and concluded it's somewhere in the filesystem/driver
    layers of linux
•   diagnosis: If you’re > 80% busy in iostat, and have a multidisk controller
    look CLOSELY at other numbers
     – Is reads/sec times avg wait < 1? Then you are serializing IO not
       getting full bandwidth of your spindles
     – If you’re near 100% busy on a multi disk controler, you should be
       seeing the thruput of multiple disks.
Concurrent vs serialized
     in this example, suppose youve taken 4 spindles in a disk array, and
     presented them as a single drive to linux by doing raid0. Linux “sees” one
     drive, /dev/sda but should still be willing to issue concurrent IO s . But is
     really issuing concurrent IO or is it serializing them and issuing 1 at a time?

d1                               Proper concurrency (2 ios compete for disk
                                 d1, some true serialization there)
d2
                                 these 4 Ios would show as < 50% busy (they
d3                               use < half the time between t0 and t1)
d4
                                               the same 4 Ios above, if serialized would
                                               show as 80% busy on the array. truly
                                               random Ios should always see some
d1
                                               overlap. These ios never overlap, if you
d2                                             detect this (with iostat) it surely means
                                               some software layer is serializing, not that
d3                                             your raid is slow.
d4

     t0                                                                t1
Concurrent vs serialized: both
                             80%80%, you have to make sure the
                                       busy
     when you see a multidisk array at
     IO/sec and response times crosscheck with the % busy figure.

                                                               80% busy
d1
                                                               9 Ios 200ms
d2
d3
d4


d1
d2
d3
                                                             80% busy
                                                             4 ios 200ms
d4
Formula for detecting serialization:
    (r/s+w/s) * svctm < 1 with
         %busy near 100
• remember that svctm is in milliseconds, so divide by 1000
   – ex 1 : (r/s + w/s)=9 so
      • (9) * .200 = 1.8
      • yes, concurrent Ios on the array
   – ex 2: (r/s + w/s) = 4 so
      • (4) * .200 = .8
      • NO, the array is not doing concurrent IO
   – BOTH AT 80% busy, but 1 is doing double the IO!
Example iostat -xkd
                    exhibiting serialization
            This is real world iostat output on a system exhibiting some kind of
            “false” serialization in the filesystem/kernel layer. I reproduced it
            using rsync, so don't think this tells you anything about youtube scale
                              colums omitted for
                              readability


Device:       ...       r/s   w/s             ...        avgqu-sz     await      svctm      %util
sda           ...       0.00 11.91            ...            0.37     31.13       1.09       1.30
sdb           ...    2663.26 12.11            ...          112.86     42.08       0.37      99.96




          reads per second                  time waited in
          writes per second                 kernel before
                                            issue to device          time actually taken by
                                                                     the device, after IO was
                                                                     finally issued to it
Example iostat -xkd
                     exhibiting serialization
 is (r/s+w/s) * svctm > 1?
 (2663+12)*.00037 = .98                                               A large spread between await and svctm
                                                                      means thigns wait in queue in the kernel
 therefore, this is serializing “above” the                           for a long time. this means higher
 raid device. avgqu-sz further suggests                               layers are issuing things concurrently bu
 it's serializing in kernel, not application                          the array cant soak them up fast enough
 (Innodb, or rsync in this case)                                      to keep the queue low

Device:        ...        r/s   w/s                 ...        avgqu-sz        await      svctm        %util
sda            ...        0.00 11.91                ...            0.37        31.13       1.09         1.30
sdb            ...     2663.26 12.11                ...          112.86        42.08       0.37        99.96




                                          If you see this much biggert
          reads per second
                                          than 32 sectors, which == 16k,
          writes per second
                                          the innodb blocksize. larger        Most disks should stay
                                          numbers point to readahead          below 10ms. This
                                          somewhere                           output is weird becaue
                                                                              it's not from a db its
                                                                              from rsync
Working around for
serialization on multidisk array
I'm not sure why, but when I presented mirrored drives to linux, then striped in md
    (software raid) the serialization went away. I believe this is related to O_SYNC,
    but maybe just the filesystem or scsi queueing side effects .




      linux md

More Related Content

What's hot

Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...PostgreSQL-Consulting
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS WorkshopAPNIC
 
MySQL Oslayer performace optimization
MySQL  Oslayer performace optimizationMySQL  Oslayer performace optimization
MySQL Oslayer performace optimizationLouis liu
 
Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)Ceph Community
 
Performance Tuning a Cloud Application: A Real World Case Study
Performance Tuning a Cloud Application: A Real World Case StudyPerformance Tuning a Cloud Application: A Real World Case Study
Performance Tuning a Cloud Application: A Real World Case Studyshane_gibson
 
Zfs Nuts And Bolts
Zfs Nuts And BoltsZfs Nuts And Bolts
Zfs Nuts And BoltsEric Sproul
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL-Consulting
 
ProfessionalVMware BrownBag VCP5 Section3: Storage
ProfessionalVMware BrownBag VCP5 Section3: StorageProfessionalVMware BrownBag VCP5 Section3: Storage
ProfessionalVMware BrownBag VCP5 Section3: StorageProfessionalVMware
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Community
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
 
Disk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVMDisk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVMnknytk
 
An Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickAn Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickeurobsdcon
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2MySQLConference
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Ceph Community
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 

What's hot (20)

Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
 
MySQL Oslayer performace optimization
MySQL  Oslayer performace optimizationMySQL  Oslayer performace optimization
MySQL Oslayer performace optimization
 
ZFS Talk Part 1
ZFS Talk Part 1ZFS Talk Part 1
ZFS Talk Part 1
 
Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)
 
Performance Tuning a Cloud Application: A Real World Case Study
Performance Tuning a Cloud Application: A Real World Case StudyPerformance Tuning a Cloud Application: A Real World Case Study
Performance Tuning a Cloud Application: A Real World Case Study
 
Zfs Nuts And Bolts
Zfs Nuts And BoltsZfs Nuts And Bolts
Zfs Nuts And Bolts
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
 
ProfessionalVMware BrownBag VCP5 Section3: Storage
ProfessionalVMware BrownBag VCP5 Section3: StorageProfessionalVMware BrownBag VCP5 Section3: Storage
ProfessionalVMware BrownBag VCP5 Section3: Storage
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Scale2014
Scale2014Scale2014
Scale2014
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
ZFS
ZFSZFS
ZFS
 
Disk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVMDisk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVM
 
An Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickAn Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusick
 
MySQL Sandbox 3
MySQL Sandbox 3MySQL Sandbox 3
MySQL Sandbox 3
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 

Viewers also liked

Database & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdf
Database & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdfDatabase & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdf
Database & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdfInSync2011
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Linux internals for Database administrators at Linux Piter 2016
Linux internals for Database administrators at Linux Piter 2016Linux internals for Database administrators at Linux Piter 2016
Linux internals for Database administrators at Linux Piter 2016PostgreSQL-Consulting
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 

Viewers also liked (9)

Database & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdf
Database & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdfDatabase & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdf
Database & Technology 1 _ Guy Harrison _ Making the most of SSD in Oracle11g.pdf
 
Raid_intro.ppt
Raid_intro.pptRaid_intro.ppt
Raid_intro.ppt
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Linux internals for Database administrators at Linux Piter 2016
Linux internals for Database administrators at Linux Piter 2016Linux internals for Database administrators at Linux Piter 2016
Linux internals for Database administrators at Linux Piter 2016
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 

Similar to Your Disk Array Is Slower Than It Should Be

Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelRed_Hat_Storage
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databasesahl0003
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practiceEugene Fidelin
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databasesAngelo Rajadurai
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storageAngelo Rajadurai
 
Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...
Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...
Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...Unidesk Corporation
 
My sql innovation work -innosql
My sql innovation work -innosqlMy sql innovation work -innosql
My sql innovation work -innosqlthinkinlamp
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
Demystifying Storage - Building large SANs
Demystifying  Storage - Building large SANsDemystifying  Storage - Building large SANs
Demystifying Storage - Building large SANsDirecti Group
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Lucidworks
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
 

Similar to Your Disk Array Is Slower Than It Should Be (20)

Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to Jewel
 
Firebird and RAID
Firebird and RAIDFirebird and RAID
Firebird and RAID
 
SSD-Bondi.pptx
SSD-Bondi.pptxSSD-Bondi.pptx
SSD-Bondi.pptx
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databases
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
 
Raid
RaidRaid
Raid
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databases
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storage
 
Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...
Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...
Get Your GeekOn With Ron - Session Two: Local Storage vs Centralized Storage ...
 
My sql innovation work -innosql
My sql innovation work -innosqlMy sql innovation work -innosql
My sql innovation work -innosql
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Demystifying Storage - Building large SANs
Demystifying  Storage - Building large SANsDemystifying  Storage - Building large SANs
Demystifying Storage - Building large SANs
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Frits Hoogland - About multiblock reads
Frits Hoogland - About multiblock readsFrits Hoogland - About multiblock reads
Frits Hoogland - About multiblock reads
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 

More from PerconaPerformance

Drizzles Approach To Improving Performance Of The Server
Drizzles  Approach To  Improving  Performance Of The  ServerDrizzles  Approach To  Improving  Performance Of The  Server
Drizzles Approach To Improving Performance Of The ServerPerconaPerformance
 
E M T Better Performance Monitoring
E M T  Better  Performance  MonitoringE M T  Better  Performance  Monitoring
E M T Better Performance MonitoringPerconaPerformance
 
Covering Indexes Ordersof Magnitude Improvements
Covering  Indexes  Ordersof Magnitude  ImprovementsCovering  Indexes  Ordersof Magnitude  Improvements
Covering Indexes Ordersof Magnitude ImprovementsPerconaPerformance
 
Automated Performance Testing With J Meter And Maven
Automated  Performance  Testing With  J Meter And  MavenAutomated  Performance  Testing With  J Meter And  Maven
Automated Performance Testing With J Meter And MavenPerconaPerformance
 
Galera Multi Master Synchronous My S Q L Replication Clusters
Galera  Multi Master  Synchronous  My S Q L  Replication  ClustersGalera  Multi Master  Synchronous  My S Q L  Replication  Clusters
Galera Multi Master Synchronous My S Q L Replication ClustersPerconaPerformance
 
My S Q L Replication Getting The Most From Slaves
My S Q L  Replication  Getting  The  Most  From  SlavesMy S Q L  Replication  Getting  The  Most  From  Slaves
My S Q L Replication Getting The Most From SlavesPerconaPerformance
 
Performance Instrumentation Beyond What You Do Now
Performance  Instrumentation  Beyond  What  You  Do  NowPerformance  Instrumentation  Beyond  What  You  Do  Now
Performance Instrumentation Beyond What You Do NowPerconaPerformance
 
Boost Performance With My S Q L 51 Partitions
Boost Performance With  My S Q L 51 PartitionsBoost Performance With  My S Q L 51 Partitions
Boost Performance With My S Q L 51 PartitionsPerconaPerformance
 
Trees And More With Postgre S Q L
Trees And  More With  Postgre S Q LTrees And  More With  Postgre S Q L
Trees And More With Postgre S Q LPerconaPerformance
 
Database Performance With Proxy Architectures
Database  Performance With  Proxy  ArchitecturesDatabase  Performance With  Proxy  Architectures
Database Performance With Proxy ArchitecturesPerconaPerformance
 
Running A Realtime Stats Service On My Sql
Running A Realtime Stats Service On My SqlRunning A Realtime Stats Service On My Sql
Running A Realtime Stats Service On My SqlPerconaPerformance
 
How To Think About Performance
How To Think About PerformanceHow To Think About Performance
How To Think About PerformancePerconaPerformance
 
Object Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And ApplicationsObject Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And ApplicationsPerconaPerformance
 

More from PerconaPerformance (17)

Drizzles Approach To Improving Performance Of The Server
Drizzles  Approach To  Improving  Performance Of The  ServerDrizzles  Approach To  Improving  Performance Of The  Server
Drizzles Approach To Improving Performance Of The Server
 
E M T Better Performance Monitoring
E M T  Better  Performance  MonitoringE M T  Better  Performance  Monitoring
E M T Better Performance Monitoring
 
Covering Indexes Ordersof Magnitude Improvements
Covering  Indexes  Ordersof Magnitude  ImprovementsCovering  Indexes  Ordersof Magnitude  Improvements
Covering Indexes Ordersof Magnitude Improvements
 
Automated Performance Testing With J Meter And Maven
Automated  Performance  Testing With  J Meter And  MavenAutomated  Performance  Testing With  J Meter And  Maven
Automated Performance Testing With J Meter And Maven
 
Galera Multi Master Synchronous My S Q L Replication Clusters
Galera  Multi Master  Synchronous  My S Q L  Replication  ClustersGalera  Multi Master  Synchronous  My S Q L  Replication  Clusters
Galera Multi Master Synchronous My S Q L Replication Clusters
 
My S Q L Replication Getting The Most From Slaves
My S Q L  Replication  Getting  The  Most  From  SlavesMy S Q L  Replication  Getting  The  Most  From  Slaves
My S Q L Replication Getting The Most From Slaves
 
Performance Instrumentation Beyond What You Do Now
Performance  Instrumentation  Beyond  What  You  Do  NowPerformance  Instrumentation  Beyond  What  You  Do  Now
Performance Instrumentation Beyond What You Do Now
 
Boost Performance With My S Q L 51 Partitions
Boost Performance With  My S Q L 51 PartitionsBoost Performance With  My S Q L 51 Partitions
Boost Performance With My S Q L 51 Partitions
 
High Performance Erlang
High  Performance  ErlangHigh  Performance  Erlang
High Performance Erlang
 
Websites On Speed
Websites On  SpeedWebsites On  Speed
Websites On Speed
 
Trees And More With Postgre S Q L
Trees And  More With  Postgre S Q LTrees And  More With  Postgre S Q L
Trees And More With Postgre S Q L
 
Database Performance With Proxy Architectures
Database  Performance With  Proxy  ArchitecturesDatabase  Performance With  Proxy  Architectures
Database Performance With Proxy Architectures
 
Using Storage Class Memory
Using Storage Class MemoryUsing Storage Class Memory
Using Storage Class Memory
 
Websites On Speed
Websites On SpeedWebsites On Speed
Websites On Speed
 
Running A Realtime Stats Service On My Sql
Running A Realtime Stats Service On My SqlRunning A Realtime Stats Service On My Sql
Running A Realtime Stats Service On My Sql
 
How To Think About Performance
How To Think About PerformanceHow To Think About Performance
How To Think About Performance
 
Object Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And ApplicationsObject Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And Applications
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Your Disk Array Is Slower Than It Should Be

  • 1. Tuning OLTP RAID IO Paul Tuckfield
  • 2. I'm talking OLTP, and just IO • You need to make sure you've done other things first, and have minimized your IO load, and arrived at the point where you must improve IO performance • You bought a multidisk controller, you should make sure it's actually using multiple disks concurrently.
  • 3. 4 ways to make OLTP IO faster 0.) Do not do IO for the query 1.) Do IO only for the rows the query needs 2.) if you must do lots of IO, do it sequentially (read ahead) but only in the DB not in the fs or raid 3.) make sure you're really doing concurrent IO to multiple spindles (many things can prevent this) digression: Sometimes it's faster to stream a whole table from disk, ignore most of the blocks, even tho an index might have gotten ony the blocks you want. Where is the tradeoff point? in OLTP almost never, because it flushes caches
  • 4. 4 ways to make OLTP IO faster 0.) Do not do IO <- Tune your queries first 1.) Do as few IO as possible <- Tune DB cache first 2.) if you must do IO, do it sequentially <- not OLTP workload, make sure filesytem and raid arent tying to “help” you by doing readahead 3.) make sure you're really doing concurrent IO <- know when you're serializing or getting concurrent IO then tune IO stack digression: flash drives. how will they change this?
  • 5. Generic rules for OLTP IO tuning • Never cache anything below the DB • Cache only writes in the raid controller (the sole exception to rule 1, assuming battery backed raid cache) • Never do read ahead anywhere but in the DB similar to not caching outside db but then is the DB doing the right thing? • Make sure your stripe/chunk size is much larger than db blocksize • make sure you're actually doing concurrent IO
  • 6. Only DB should cache reads In a reasonable InnoDB setup, your db cache will be smaller than you disk based dataset, the filesystem cache (if you insisted on having one against my advice) should be smaller, the raid cache will be smaller still (ram is cheaper than a disk array) Imagine the db has read blocks 1 thru 11 , this is what the caches look like. Each cache holds only N most recent blocks. Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 FS cache 7 | 8 | 9 | 10 | 11 RAID cache 10 | 11 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
  • 7. Only DB should cache reads Suppose a read of any block: If it's in any of these caches, it will also be in the DB cache, because the db cache is the largest. therefore the read caches of the file system and raid controler will never be used usefully on a cache hit. Suppose it's an InnoDB cache miss on the other hand: it won't be in fs or raid cache then, since they're smaller and cant have a longer LRU list than InnoDB. therefore, they're useles. Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 FS cache 7 | 8 | 9 | 10 | 11 RAID cache 10 | 11 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
  • 8. Only the DB should readahead • the db knows or at least has far better heuristic data to know when it needs to read ahead( Innodb method is questionable IMO) • filesystem, raid are making much weaker guesses if they do readahead, it's more or less hard coded, not good. nominal case should be a single block. • When they do readahead, they're essentially doing even more read caching, which we've established is useless or even bad in these layers. • Check “avgrq-sz” in iostat to see if some upper layer is doing > 1 block (16k in the case of innodb) read
  • 9. Only cache writes in controller suppose after reading 0 thru 11, the db modifies blocks 10 and 11. the blocks are written to the FS cache then to the raid cache. Until acknolegments come back, theyre “dirty pages” signified in red Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 FS cache 7 | 8 | 9 | 10 | 11 RAID cache 10 | 11 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 |11| 12 | 13 | 14 | 15
  • 10. Only cache writes in controller When the raid puts the blocks in ram, it tells the FS that the block is written, likewise the fs tells the db it's written. Both layers then consider the page to be “clean” signified by turning them black here. Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 FS cache 7 | 8 | 9 | 10 | 11 write acks flow back RAID cache 10 | 11 up, RAID acks before actually writing to dsks 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 |11| 12 | 13 | 14 | 15
  • 11. Only cache writes in controller The database can burst lots of writes to the raid cache before they ever get committed to disk, which is good: the raid offloads all the waiting and housekeeping of writing to persistent storage. Now Suppose a cache miss occurs requesting block 14. It can’t be put into cache until block 10 is written, and the LRU is “pushed down” Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 FS cache 7 | 8 | 9 | 10 | 11 RAID cache 10 | 11 disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10| 11| 12 | 13 | 14 | 15
  • 12. Only cache writes in controller The block 10 is retired at the cost of 1 IO before 14 is ever accessed. Db cache 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 14 FS cache 8 | 9 | 10 | 11 |14 11 | RAID cache disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
  • 13. Only cache writes in controller Now, 14 can be put into the raid cache, a cost of 1 more IO, before sending on to the host Db cache 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 14 FS cache 8 | 9 | 10 | 11 |14 11 | 14 RAID cache disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
  • 14. This is the ideal I seek : no filesystem cache at all, and only caching writes on the raid. In the same scenario, you can see how the read of block 2 will not serialize behind the writes of block 10. Db cache 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11| FS cache 10 | 11 RAID cache disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10| 11| 12 | 13 | 14 | 15
  • 15. Only cache writes in controller A real RAID cache is more complex than this example, sometimes involving high water marks or some scheme for splitting the cache between dirty and clean pages. But the principle is the same: after a burst of writes , the raid will finally decide that it shouldn't allow more writes or reads until it can evict some dirty pages. I like to think of this as “borrowing” from future bandwidth when doing a burst of writes. Now, we've already established that the read cache is useless, because any cache hit will happen in the upper , much larger layers. If dirty pages are at the tail of an LRU cache, that means any reads will also have to wait for the dirty page to be written before proceeding (filling a cache buffer in the raid controler before passing up to the host) therefore, it's great to cache writes as long as you do not cache reads in the controler. Some controlers (I think emc for example) may partition dirty and clean pages, and also partition by spindles, so that reads still dont serialize behind writes and no one spindle “borrows” too much bandwidth, leaving a cache full of dirty pages that must be retired to only one spindle for eample. but again, we've established that raid read cache is useless in a nominal setup, cause it's always smaller than the host side cache, therefore may as well just give all the space to writes.
  • 16. Only cache writes in controller • Burst writes “borrow” bandwidth by writing to raid cache • at some point, new Ios cannot continue, because controller must “pay back” IO freeing up cache slots. • if reads don't go thru this cache they never serialize here. Some controllers partition, which is good enough. • Various caching algorithms vary this somewhat, but in the end they at best are no better than not caching reads, at worst, hurt overall performance
  • 17. Make stripe size >> blocksize • suppose you read a 16k block from a striped (or raid10) with a small chunksize. You “engage” every spindle to service one IO. 16k stripe / 2k chunk
  • 18. Make stripe size >> blocksize • now suppose 3 reads are issued by the host • Seek /rotate on all drives (5 ms) • Deliver 2k in parallel from all drives (<1ms) • Seek /rotate on all drives (5 ms) • Deliver 2k in parallel from all drives (<1ms) • Seek /rotate on all drives (5 ms) • Deliver 2k in parallel from all drives (<1ms) • THIS IS SERIALIZATION THAT SHOULD BE PARALLEL 16k stripe / 2k chunk
  • 19. Make stripe size >> blocksize • issue 3 concurrent IO : Seek/rotate on 3 random drives drives (5 ms) • Deliver 3x16k in parallel from 3 different drives (<1ms): each block “fits” on a single spindle. • Not that smaller chunk sizes also increase the likelihood of misaligned reads engaging 2 spindles instead of 1 • A write of a given block, of course , engages 2x spindles in a mirrored config 1M stripe / 128k chunk
  • 20. Generic rules for OLTP IO tuning, with “answers” • Never cache anything below the DB – innodb_flush_method=O_DIRECT eliminates the FS cache. – If you must use or cant turn off fs cache, look into vm.swappiness parameter to prevent paging – I've seen serialization happen at this layer for reasons I dont understand, is suspect related to O_DIRECT. but worked around by mirroring in hardware and striping in md
  • 21. Generic rules for OLTP IO tuning, with “answers” • Cache only writes in the raid controller, no readahead: – adaptec (currently our controller for live dbs): • /usr/local/sbin/arcconf setcache 1 logicaldrive 1 roff • Implements my ideal of caching writes but not reads, and not doing readahead
  • 22. Generic rules for OLTP IO tuning, with “answers” • Cache only writes in the raid controller, no readahead – MegRAID (our controler thru 2006): • write Policy = WRBACK • Read Policy = NORMAL • Cache Policy = DirectIO – These settings were what I had before we switched to adaptec. Never decided whether best to allow read/write cache w/o readahead, or turn both off with DirectIO
  • 23. Generic rules for OLTP IO tuning, with “answers” • Cache only writes in the raid controller, no readahead – 3ware (I have a 9500S,never put a real workload on one) • no aparent ability to shut off readahead, not sure it's doing readahead • can shut of write caching, but not read (opposite of what I want) • my memory is vague on this controler, i used it for a fileserver a while back, but did token testing on it with database in mind
  • 24. My ideal No FS cache at all, and only caching writes in the controler. with this setup in place, it comes time to iostat sees look at IOSTAT this layer Db cache 0 | 1 | 2 | 6 | 7 | 8 | 9 | 10 | 11 FS cache 10 | 11 RAID cache disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10| 11| 12 | 13 | 14 | 15
  • 25. Detecting serialization on multidisk arrays • I've observed well tuned disk arrays serializing io for some reason,when they should be parallelizing (doing multiple Ios to multiple disks concurrentl) and concluded it's somewhere in the filesystem/driver layers of linux • diagnosis: If you’re > 80% busy in iostat, and have a multidisk controller look CLOSELY at other numbers – Is reads/sec times avg wait < 1? Then you are serializing IO not getting full bandwidth of your spindles – If you’re near 100% busy on a multi disk controler, you should be seeing the thruput of multiple disks.
  • 26. Concurrent vs serialized in this example, suppose youve taken 4 spindles in a disk array, and presented them as a single drive to linux by doing raid0. Linux “sees” one drive, /dev/sda but should still be willing to issue concurrent IO s . But is really issuing concurrent IO or is it serializing them and issuing 1 at a time? d1 Proper concurrency (2 ios compete for disk d1, some true serialization there) d2 these 4 Ios would show as < 50% busy (they d3 use < half the time between t0 and t1) d4 the same 4 Ios above, if serialized would show as 80% busy on the array. truly random Ios should always see some d1 overlap. These ios never overlap, if you d2 detect this (with iostat) it surely means some software layer is serializing, not that d3 your raid is slow. d4 t0 t1
  • 27. Concurrent vs serialized: both 80%80%, you have to make sure the busy when you see a multidisk array at IO/sec and response times crosscheck with the % busy figure. 80% busy d1 9 Ios 200ms d2 d3 d4 d1 d2 d3 80% busy 4 ios 200ms d4
  • 28. Formula for detecting serialization: (r/s+w/s) * svctm < 1 with %busy near 100 • remember that svctm is in milliseconds, so divide by 1000 – ex 1 : (r/s + w/s)=9 so • (9) * .200 = 1.8 • yes, concurrent Ios on the array – ex 2: (r/s + w/s) = 4 so • (4) * .200 = .8 • NO, the array is not doing concurrent IO – BOTH AT 80% busy, but 1 is doing double the IO!
  • 29. Example iostat -xkd exhibiting serialization This is real world iostat output on a system exhibiting some kind of “false” serialization in the filesystem/kernel layer. I reproduced it using rsync, so don't think this tells you anything about youtube scale colums omitted for readability Device: ... r/s w/s ... avgqu-sz await svctm %util sda ... 0.00 11.91 ... 0.37 31.13 1.09 1.30 sdb ... 2663.26 12.11 ... 112.86 42.08 0.37 99.96 reads per second time waited in writes per second kernel before issue to device time actually taken by the device, after IO was finally issued to it
  • 30. Example iostat -xkd exhibiting serialization is (r/s+w/s) * svctm > 1? (2663+12)*.00037 = .98 A large spread between await and svctm means thigns wait in queue in the kernel therefore, this is serializing “above” the for a long time. this means higher raid device. avgqu-sz further suggests layers are issuing things concurrently bu it's serializing in kernel, not application the array cant soak them up fast enough (Innodb, or rsync in this case) to keep the queue low Device: ... r/s w/s ... avgqu-sz await svctm %util sda ... 0.00 11.91 ... 0.37 31.13 1.09 1.30 sdb ... 2663.26 12.11 ... 112.86 42.08 0.37 99.96 If you see this much biggert reads per second than 32 sectors, which == 16k, writes per second the innodb blocksize. larger Most disks should stay numbers point to readahead below 10ms. This somewhere output is weird becaue it's not from a db its from rsync
  • 31. Working around for serialization on multidisk array I'm not sure why, but when I presented mirrored drives to linux, then striped in md (software raid) the serialization went away. I believe this is related to O_SYNC, but maybe just the filesystem or scsi queueing side effects . linux md