Distro Recipes 2013 : My ${favorite_linux_distro} is slow!

My ${favorite} Linux Distribution is slow !

Credit : fras1977@flickr

Distro Recipes 4th April 2013 @Paris

Performance does matter

● Users expects more performance
● They do have perfect hardware
● They installed the latest OS release
● So it shall be faster than ever ! Isn't it ?
● But we still got thoses imprecise reports .....

« Hey ! My Linux Distro is Slow ! »
« The latest OS reduces the performance ! »

About this talk

● What to expect ?
– Tricks to proove distro is not always the bad guy
– A compilation of real debugging sessions

● What not to expect ?
– Having one magic answer about perf.

● Who are you ?

Tracking the beast

● Slowdowns come from various sources
– CPU
– Storage
– Interrupts
– Memory
– Network (not included in this presentation)
– Applications (not included)

CPU load

● Estimating the load of the CPU is pretty easy
● Using « top » with a sort on « cpu load »
– Don't mixup with loadavg !

Weird CPU issues

● Temperature
– Internal throttling to avoid overheat
– ~110/120° on Intel CPUs
– Monitoring via coretemp & acpi
« CPU1: Core temperature above threshold,
cpu clock throttled (total events = 12841) »
– Generates Machine Check Exceptions (MCE)
– As a result, CPU performance are reduced

Storage Load

● Massive IOs can slow down a system seriously
– Depending on the storage device ( HDD vs SSD)
– Depending on the IO profile (sequential vs random)
– « vmstat » is useful to track this behavior

bi = blocks in
bo = blocks out
wa = waiting IO
si = swap in
so = swap out

Someone reads a lot !

Storage Load
bi = blocks in
bo = blocks out
wa = waiting IO
si = swap in
so = swap out

Someone try to read a lot !
(3 threads read 4K random)

● CPU does wait the storage device (~30% wa)
● HDD + 3 threads @ 4K random generates a massive
device load
● During this load, my system was unusable
● A desktop search, rsync, tar, ... can generate such load

Storage Load

● A broken/slow storage device can load the system

● HDD : Broken sectors reallocation are invisible but lags
● SATA disks tries several time to recover sectors
● No other IOs will be accepted during this process
● Kills RAID-arrays
● Enterprise-class SATA disks reallocates immediately

● SMART to count {broken|pending|reallocated} sectors
● %wa in top or vmstat shall be high in such case

Storage Load
● « smartctl -a /dev/sda » of a dying HDD disk

Storage Load

● SSDs : Far from a perfect device
● Performance may vary regarding various fw implementations
● SLC front cache before reaching the MLC storage
– Getting out-of-cache effect
– 200+MB/s on SLC
– 5MB/s on MLC in worst case
– After a while, global SSD performance is limited : 5MB/sec
– Behavior not visible for {simple|short} workload

– %wa in top or vmstat shall increase in such case
– Can be reproduced by using fio
http://git.kernel.dk/?p=fio.git

SSD IO Path

SATA IOs
IO
6Gb/sec
Controller MLC
960
Cells
Mb/sec

SLC 40Mb/sec
Cache

Weird Storage Issues

● Temperature
– On HDDs, thermal recalibration occurs too often to maintain
a certain level of service.
– Media-class disks are less subject to this effect

● Vibrations
– Raid arrays contains several HDDs spinning constantly
– All this individual vibrations prevent heads being properly
aligned leading to heads' recalibrations
– That could totally prevent a raid array from delivering IOs

IRQ Storms

● Inside a +1200 array of identical computers
● Some are booting very very slowly and engage some
software watchdogs
● /proc/interrupts reports IRQ storm (66000 per sec) on
interrupt 19
● CPU is permanently interrupted by IRQs
● AHCI controller floods as HDD doesn't answer on
ATA_IDENTIFY requests (seen by extracting HDD)
●
AHCI driver fails at probing so int19 only reports usb dev.
● Some hardware failures can lead to load issues

Memory Issues

● 2 identical servers that doesn't perform the same
– One is really slower than the other
● Same server brand / model
● Same vendor
● Same hardware setup
● But really performs differently....
● What the hell my {application|os} is doing wrong
here ?

Memory Issues

● Memory banks were not populated with the same HW
● Some were DDR3 with a CAS Latency = 9
● Some were DDR3 with a CAS Latency = 11
● As a result the memory access were slower on one
● This got detected at runtime under Linux with DDR3
timing tool from Cyring. (http://code.cyring.fr/FTS/?
PATH=Source/C/DDR3_Timings/0.2/timings.c)
● Hardware setups were supposed to be the same !

Dear Loadavg,

● You are complicated to understand
● You don't help tracking the source of the load
● You can be a lier if some kernel code don't update you

● But you provide an indicator on the global load
– 1.0 means 100% of the ressources

● I'll keep you as a raw indicator to start my investigations

Thanks !

● Email : erwanliasr1@gmail.com

● IRC : erwan_taf @ {freenode | oftc }

Distro Recipes 2013 : My ${favorite_linux_distro} is slow!

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à Distro Recipes 2013 : My ${favorite_linux_distro} is slow!

Similaire à Distro Recipes 2013 : My ${favorite_linux_distro} is slow! (20)

Plus de Anne Nicolas

Plus de Anne Nicolas (20)

Dernier

Dernier (20)

Distro Recipes 2013 : My ${favorite_linux_distro} is slow!