LinuxCon 2011: OpenVZ and Linux Kernel Testing

•Télécharger en tant que ODP, PDF•

3 j'aime•1,963 vues

I'm curious. For the past few months, people@openvz.org have discovered (and fixed) an ongoing stream of obscure but serious and quite long-standing bugs. How are you discovering these bugs? Andrew added later: hm, OK, I was visualizing some mysterious Russian bugfinding machine or something. Don't stop ;)

Technologie

1
Andrew Vagin <avagin@parallels.com>
Developer, Linux Kernel team
OpenVZ and Linux Kernel Testing

2
Agenda
●
Linux containers and OpenVZ
●
Ideal test lab
●
Testing techniques
●
Performance testing
●
Anecdotes

3
Andrew Morton
I'm curious. For the past few months, people@openvz.org have
discovered (and fixed) an ongoing stream of obscure but serious and
quite long-standing bugs.
How are you discovering these bugs?
Andrew added later:
hm, OK, I was visualizing some mysterious Russian bugfinding
machine or something.
Don't stop ;)
David Miller
This issue has existed since the very creation of the netlink code :-)

4
Linux Containers (LXC)
Many isolated environments on top of a single kernel
●
Namespaces
●
Resource accounting
●
Better resource accounting
●
Checkpointing and live migration
●
Extra features: cpu limits, NFS inside CTs, etc
OpenVZ Containers

5
What makes a good test lab?
●
Fully automated system with deployment service
●
A web interface for test scheduling
●
Standard test sets (“combo #3, make it large”)
●
A web interface for test results (comparisons, graphs,
logs)
●
Integration with a bug tracking system
●
Net or serial console to collect kernel oopses
●
KVM, power switch, other goodies

6
How do we find bugs in the mainstream kernel
Containers help us find more bugs
●
Independent life cycles
●
Precise resource accounting
Containers allow us to
●
Test initialization/finalization of kernel subsystems
●
Test error paths
●
Catch more leaks than the regular testing does
●
Catch more race conditions by means of stress testing

7
Start/stop test
●
Massive parallel start/stop and suspend/resume
●
Random resource parameters
Helps to catch:
●
Race conditions
●
Test error paths
●
Memory leaks

8
What makes a good performance test?
●
Effective load:
●
Atomic (UnixBench)
●
Complex (LAMP, SPEC-JBB, vConsolidate)
●
Sane test environment (no random cron jobs etc.)
●
Automation (minimize human interaction)
●
Reproducible results, minimize variability
●
Understand test results, even good ones

12
Density testing
●
High density is important feature of OpenVZ (vs VMs)
●
Test measures response time on a number of CTs
●
increasing the number of CTs until time is bad
●
It's not a stress test
●
Produce a big resource overcommit

13
Other useful tests
●
Week load test replays real httpd logs in real containers
●
Feature tests: isolation, CPU scheduler, checkpointing,
network virtualization, second level quota, etc.
●
Third-party tests: LTP, Сonnectathon, vSpecJBB,
vConsolidate, UNIX bench, sysbench, DVD-store, Netperf

15
(1) How a Russian bug finding machine works
●
QA found a leak of 78 bytes of kernel memory
●
Developer was unable to reproduce a bug
●
He found that this is a leak of a 'struct user' object
●
He audited kernel code which references this object
●
Found one suspicious place
●
Wrote a demo code to trigger the bug, and a fix
●
...
●
PROFIT!

16
(2) How resource controls prevented a DoS attack
uid / resource held maxheld barrier limit failcnt
numothersocks 9 360 360 360 1
uid / resource held maxheld barrier limit failcnt
kmemsize 1237973 14372344 14372700 14790164 80
numothersocks 9 360 360 360 1
A simple kernel attack using socketpair()
a.k.a. CVE 2010-4249

18
(3) How a guy measured netns performance
●
It was a nice sunny day...
●
5 different configurations to test
●
Unpredictable, random results
●
CPU throttling caused by overheating;
adding a case fan helped!

20
Conclusion
● Containers are good for kernel testing
● Resource limits (cgroups) are also helpful
● [most] performance tests are hoax

21
Andrew Vagin <avagin@parallels.com>
Thank you.
Questions?

Recommandé

The automated tests inside OpenshiftOleg Popov

Robot EvolutionOleg Popov

Automated testing with OpenshiftOleg Popov

Software TestingAndrew Wang

Open ZFS Keynote (public)Dustin Kirkland

Testing Django APIstyomo4ka

Aleksandr Matrosov, Eugene Rodionov - Win32 Duqu - involution of StuxnetDefconRussia

Kernel Recipes 2015: The stable Linux Kernel Tree - 10 years of insanityAnne Nicolas

Recommandé

The automated tests inside OpenshiftOleg Popov

Robot EvolutionOleg Popov

Automated testing with OpenshiftOleg Popov

Software TestingAndrew Wang

Open ZFS Keynote (public)Dustin Kirkland

Testing Django APIstyomo4ka

Aleksandr Matrosov, Eugene Rodionov - Win32 Duqu - involution of StuxnetDefconRussia

Kernel Recipes 2015: The stable Linux Kernel Tree - 10 years of insanityAnne Nicolas

China.z / Trojan.XorDDOS - Analysis of a hackhendrikvb

Kernel Recipes 2016 - The kernel reportAnne Nicolas

Barcamp presentationVachagan Balayan

Moscow virtualization meetup 2014: CRIU 1.0 What is next?Andrey Vagin

Blocks, procs && lambdasVidmantas Kabošis

Kernel Recipes 2016 - Patches carved into stone tablets...Anne Nicolas

Write an Android libraryRomain Rochegude

Testing Rest with Spring by Kostiantyn Baranov (Senior Software Engineer, Gl...GlobalLogic Ukraine

Intro to Kernel Debugging - Just make the crashing stop!All Things Open

An introduction to Node.js application developmentshelloidhq

Deliver Faster with BDD/TDD - Designing Automated Tests That Don't SuckKevin Brockhoff

Easy access to open stack object storageJuan José Martínez

Git and TestingChristian Couder

Ctf cliVadim Rutkovsky

Cfgmgmt Challenges aren't technical anymoreJulien Pivotto

Performance Testing in Production - Leveraging the Universal Scalability LawKevin Brockhoff

VpmGlobalLogic Ukraine

Enjoy fighting regressions_with_git_bisectChristian Couder

Автоматическая оптимизация алгоритмов с помощью быстрого возведения матриц в ...Alexander Borzunov

Tarantool Silverbox, Юрий ВостриковFuenteovejuna

Containers in a FileOpenVZ

Обратные фьючерсы в биткойнахAleksey Bragin

Contenu connexe

Tendances

China.z / Trojan.XorDDOS - Analysis of a hackhendrikvb

Kernel Recipes 2016 - The kernel reportAnne Nicolas

Barcamp presentationVachagan Balayan

Moscow virtualization meetup 2014: CRIU 1.0 What is next?Andrey Vagin

Blocks, procs && lambdasVidmantas Kabošis

Kernel Recipes 2016 - Patches carved into stone tablets...Anne Nicolas

Write an Android libraryRomain Rochegude

Testing Rest with Spring by Kostiantyn Baranov (Senior Software Engineer, Gl...GlobalLogic Ukraine

Intro to Kernel Debugging - Just make the crashing stop!All Things Open

An introduction to Node.js application developmentshelloidhq

Deliver Faster with BDD/TDD - Designing Automated Tests That Don't SuckKevin Brockhoff

Easy access to open stack object storageJuan José Martínez

Git and TestingChristian Couder

Ctf cliVadim Rutkovsky

Cfgmgmt Challenges aren't technical anymoreJulien Pivotto

Performance Testing in Production - Leveraging the Universal Scalability LawKevin Brockhoff

VpmGlobalLogic Ukraine

Enjoy fighting regressions_with_git_bisectChristian Couder

Tendances (18)

China.z / Trojan.XorDDOS - Analysis of a hack

Kernel Recipes 2016 - The kernel report

Barcamp presentation

Moscow virtualization meetup 2014: CRIU 1.0 What is next?

Blocks, procs && lambdas

Kernel Recipes 2016 - Patches carved into stone tablets...

Write an Android library

Testing Rest with Spring by Kostiantyn Baranov (Senior Software Engineer, Gl...

Intro to Kernel Debugging - Just make the crashing stop!

An introduction to Node.js application development

Deliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck

Easy access to open stack object storage

Git and Testing

Ctf cli

Cfgmgmt Challenges aren't technical anymore

Performance Testing in Production - Leveraging the Universal Scalability Law

Vpm

Enjoy fighting regressions_with_git_bisect

En vedette

Автоматическая оптимизация алгоритмов с помощью быстрого возведения матриц в ...Alexander Borzunov

Tarantool Silverbox, Юрий ВостриковFuenteovejuna

Containers in a FileOpenVZ

Обратные фьючерсы в биткойнахAleksey Bragin

HaltDos DDoS Protection SolutionHaltdos

Skyforge rendering tech (KRI 2014)Sergey Makeev

PyconRu 2016. Осторожно, DSL!Ivan Tsyganov

Query expansionNLPseminar

SSL/TLS: история уязвимостейPositive Hack Days

Электронная коммерция: от Hadoop к Spark ScalaRoman Zykov

Parallels #RIW/16 Новые разработки от идеи до релизаDmitry Smirkin

Решение суда о запрете сайтов о BitcoinArtem Kozlyuk

проект "РосПил". Отчет за 2011-2012 гг.ros-pil

Value Objects, Full Throttle (to be updated for spring TC39 meetings)Brendan Eich

Keynote, PNW Scala 2013Paul Phillips

Optimising Your Front End Workflow With Symfony, Twig, Bower and GulpMatthew Davis

#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...e-Legion

#MBLTdev: Современная аутентификация (PayPal)e-Legion

#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...e-Legion

Php internal architectureElizabeth Smith

En vedette (20)

Автоматическая оптимизация алгоритмов с помощью быстрого возведения матриц в ...

Tarantool Silverbox, Юрий Востриков

Containers in a File

Обратные фьючерсы в биткойнах

HaltDos DDoS Protection Solution

Skyforge rendering tech (KRI 2014)

PyconRu 2016. Осторожно, DSL!

Query expansion

SSL/TLS: история уязвимостей

Электронная коммерция: от Hadoop к Spark Scala

Parallels #RIW/16 Новые разработки от идеи до релиза

Решение суда о запрете сайтов о Bitcoin

проект "РосПил". Отчет за 2011-2012 гг.

Value Objects, Full Throttle (to be updated for spring TC39 meetings)

Keynote, PNW Scala 2013

Optimising Your Front End Workflow With Symfony, Twig, Bower and Gulp

#MBLTdev: Практический пример переиспользования кода. Как повысить качество и...

#MBLTdev: Современная аутентификация (PayPal)

#MBLTdev: Kotlin для Android, или лёгкий способ перестать программировать на ...

Php internal architecture

Similaire à LinuxCon 2011: OpenVZ and Linux Kernel Testing

Improving Engineering Processes using Hudson - Spark IT 2010Arun Gupta

Series of Unfortunate Netflix Container Events - QConNYC17aspyker

Testing kubernetes and_open_shift_at_scale_20170209mffiedler

The State of the Veil FrameworkVeilFramework

Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...OPNFV

Resilience Testing Ran Levy

The Future of Security and Productivity in Our Newly Remote WorldDevOps.com

BSides London 2022 - Introducing varc_ Volatile Artifact Collector (2).pdfMattMuir5

Containers > VMsDavid Timothy Strauss

DevOps in realtimeAndriy Samilyak

OpenVZ Linux ContainersKirill Kolyshkin

Масштабируемый и эффективный фаззинг Google ChromePositive Hack Days

Leveraging chaos mesh in Astra Serverless testingPierre Laporte

Unit testing (eng)Anatoliy Okhotnikov

Unmanned Aerial Vehicles: Exploit Automation with the Metasploit Frameworkegypt

Ippevent : openshift Introductionkanedafromparis

Surge2012davidapacheco

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch

Cloud Native Java Development PatternsBilgin Ibryam

Practical RISC-V Random Test Generation using Constraint Programminged271828

Similaire à LinuxCon 2011: OpenVZ and Linux Kernel Testing (20)

Improving Engineering Processes using Hudson - Spark IT 2010

Series of Unfortunate Netflix Container Events - QConNYC17

Testing kubernetes and_open_shift_at_scale_20170209

The State of the Veil Framework

Testing, CI Gating & Community Fast Feedback: The Challenge of Integration Pr...

Resilience Testing

The Future of Security and Productivity in Our Newly Remote World

BSides London 2022 - Introducing varc_ Volatile Artifact Collector (2).pdf

Containers > VMs

DevOps in realtime

OpenVZ Linux Containers

Масштабируемый и эффективный фаззинг Google Chrome

Leveraging chaos mesh in Astra Serverless testing

Unit testing (eng)

Unmanned Aerial Vehicles: Exploit Automation with the Metasploit Framework

Ippevent : openshift Introduction

Surge2012

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

Cloud Native Java Development Patterns

Practical RISC-V Random Test Generation using Constraint Programming

Dernier

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

AI as an Interface for Commercial BuildingsMemoori

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Commit 2024 - Secret Management made easyAlfredo García Lavilla

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Dernier (20)

Artificial intelligence in cctv survelliance.pptx

Designing IA for AI - Information Architecture Conference 2024

The Future of Software Development - Devin AI Innovative Approach.pdf

Dev Dives: Streamline document processing with UiPath Studio Web

My INSURER PTE LTD - Insurtech Innovation Award 2024

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Vertex AI Gemini Prompt Engineering Tips

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

My Hashitalk Indonesia April 2024 Presentation

WordPress Websites for Engineers: Elevate Your Brand

Advanced Test Driven-Development @ php[tek] 2024

AI as an Interface for Commercial Buildings

Gen AI in Business - Global Trends Report 2024.pdf

Ensuring Technical Readiness For Copilot in Microsoft 365

Search Engine Optimization SEO PDF for 2024.pdf

Powerpoint exploring the locations used in television show Time Clash

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Commit 2024 - Secret Management made easy

DevoxxFR 2024 Reproducible Builds with Apache Maven

Human Factors of XR: Using Human Factors to Design XR Systems

LinuxCon 2011: OpenVZ and Linux Kernel Testing

1. 1 Andrew Vagin <avagin@parallels.com> Developer, Linux Kernel team OpenVZ and Linux Kernel Testing

2. 2 Agenda ● Linux containers and OpenVZ ● Ideal test lab ● Testing techniques ● Performance testing ● Anecdotes

3. 3 Andrew Morton I'm curious. For the past few months, people@openvz.org have discovered (and fixed) an ongoing stream of obscure but serious and quite long-standing bugs. How are you discovering these bugs? Andrew added later: hm, OK, I was visualizing some mysterious Russian bugfinding machine or something. Don't stop ;) David Miller This issue has existed since the very creation of the netlink code :-)

4. 4 Linux Containers (LXC) Many isolated environments on top of a single kernel ● Namespaces ● Resource accounting ● Better resource accounting ● Checkpointing and live migration ● Extra features: cpu limits, NFS inside CTs, etc OpenVZ Containers

5. 5 What makes a good test lab? ● Fully automated system with deployment service ● A web interface for test scheduling ● Standard test sets (“combo #3, make it large”) ● A web interface for test results (comparisons, graphs, logs) ● Integration with a bug tracking system ● Net or serial console to collect kernel oopses ● KVM, power switch, other goodies

6. 6 How do we find bugs in the mainstream kernel Containers help us find more bugs ● Independent life cycles ● Precise resource accounting Containers allow us to ● Test initialization/finalization of kernel subsystems ● Test error paths ● Catch more leaks than the regular testing does ● Catch more race conditions by means of stress testing

7. 7 Start/stop test ● Massive parallel start/stop and suspend/resume ● Random resource parameters Helps to catch: ● Race conditions ● Test error paths ● Memory leaks

8. 8 What makes a good performance test? ● Effective load: ● Atomic (UnixBench) ● Complex (LAMP, SPEC-JBB, vConsolidate) ● Sane test environment (no random cron jobs etc.) ● Automation (minimize human interaction) ● Reproducible results, minimize variability ● Understand test results, even good ones

9. 12 Density testing ● High density is important feature of OpenVZ (vs VMs) ● Test measures response time on a number of CTs ● increasing the number of CTs until time is bad ● It's not a stress test ● Produce a big resource overcommit

10. 13 Other useful tests ● Week load test replays real httpd logs in real containers ● Feature tests: isolation, CPU scheduler, checkpointing, network virtualization, second level quota, etc. ● Third-party tests: LTP, Сonnectathon, vSpecJBB, vConsolidate, UNIX bench, sysbench, DVD-store, Netperf

11. 14 Real life stories

12. 15 (1) How a Russian bug finding machine works ● QA found a leak of 78 bytes of kernel memory ● Developer was unable to reproduce a bug ● He found that this is a leak of a 'struct user' object ● He audited kernel code which references this object ● Found one suspicious place ● Wrote a demo code to trigger the bug, and a fix ● ... ● PROFIT!

13. 16 (2) How resource controls prevented a DoS attack uid / resource held maxheld barrier limit failcnt numothersocks 9 360 360 360 1 uid / resource held maxheld barrier limit failcnt kmemsize 1237973 14372344 14372700 14790164 80 numothersocks 9 360 360 360 1 A simple kernel attack using socketpair() a.k.a. CVE 2010-4249

14. 18 (3) How a guy measured netns performance ● It was a nice sunny day... ● 5 different configurations to test ● Unpredictable, random results ● CPU throttling caused by overheating; adding a case fan helped!

15. 20 Conclusion ● Containers are good for kernel testing ● Resource limits (cgroups) are also helpful ● [most] performance tests are hoax

16. 21 Andrew Vagin <avagin@parallels.com> Thank you. Questions?

Notes de l'éditeur

My name is Andrey Vagin. I have been working on OpenVZ for the last 5 years. I started working as a QA engineer, developing and running Linux kernel tests. Then I moved to the Linux kernel team as a developer. This talk tries to summarize the experience of me and my colleagues at Parallels.
I want to tell you how we test OpenVZ Linux kernel. I start by explaining what OpenVZ really is. Next, I share some thoughts about an ideal test lab. Then we'll see which testing techniques are good for kernel testing, and in particular why OpenVZ is helping us to find more bugs. Also, I'd like to say a few words about performance testing. Finally, a few anecdotal cases of bugs found will be presented.
We regularly find and fix bugs in different subsystems of the Linux kernel. Often these bugs are obscure, long-standing and hard to catch. Sometimes maintainers wonder, how we find those bugs. Right now I want to reveal all of our deep secrets.
But before I start, I want to say a few words about Linux Containers and OpenVZ Containers. A container is an isolated environment. Each container has its own user, network, filesystem and other namespaces that virtualize various kernel subsystems. Plus, there are cgroups for additional resource accounting. All containers are running on top of one single kernel – this is what makes them different from virtual machines. Containers do have some restrictions (like, on a Linux machine we can only have Linux containers), but the technology is more effective, because it doesn't do things such as emulation of hardware devices, or running multiple kernels. Compared to LXC, OpenVZ Containers have better resource accounting and some extra features such as cpu limits, checkpointing and live migration, NFS and FUSE inside containers and so on.
Based on our experience, these are the requirements for a good test lab. First, a test system is fully automatic. It should include the Deployment Service, the results portal, many different configurations of servers and additional hardware such as kvm, power switches and so on. All this components should be tightly integrated together and work smoothly. They may be controlled via web interface. The test system should have easy way to execute tests and find or compare restuls.
A lot of people are testing the Linux kernel, but for us containers play a special role in the process. A container initializes many kernel subsystems on start and destroys them on stop. On a usual system such operations are only done on boot and shutdown. It is hard to perform these operations many times, plus usually after all deinit operations the system is shutting down. Containers give us a way to perform multiple concurrent init/deinit sequences. It helps to find bugs such as not freeing of some resource. Plus, we have per-container resource accounting, which helps in detecting memory leaks. Also it enables to test various seldom error paths when we set different limits on resources.
Now I want to tell about one of significant tests, it's called Start-stop test. It starts/stops and suspends/resumes many containers simultaneously and sets random resource limits, just for some more fun. Can you imagine this test may find many bugs? Probably you are not sure, but it does, and finds bugs not only in OpenVZ kernel, but in the mainstream kernel, too. Actually it's also a stress test, since it generates a heavy load. In additional it executes many initialization and finalization of kernel subsystems. Also, this test forces the kernel to execute error paths due to randomization of resource limits. On each iteration it does some sanity checks. For example, it checks that all resource usage counters are zero after a container is stopped. It catches leaks, race conditions, errors on subsystem finalization and even leaks on error paths caused by race conditions.
Performance Testing is the most difficult part of testing. The results of these tests are published and users look at the numbers when choosing a product. So, test results should be comprehensible and reproducible. A main problem in creating of a performance test is to think up a useful workload. All performance tests may be divided into atomic tests and complex tests. Atomic tests make simple basic operations such as context switching, creating a file or forking a process. The to see a full picture, so they are more interested in complex tests. A complex test simulates some real workload. What should be a good performance test? Ideally the test should be fully automatic to avoid human factors and ensure consistency. A person may forget to do something or may do it in another way next time. If you can't automate the test, you should at least describe the process in great details. You should avoid side effects such as cron jobs, other extra daemons doing some work from time to time, data base index rebuild, CPU scaling and other such stuff. You can't be too much careful here. We have a special script which validates a test environment. The script is regularly updated when we find a new thing. The test should run several iterations and calculate statistical errors, to make sure results are reproducible. Often the system requires some time for stabilization and for this purpose you can execute a few warm-up iterations, ignoring their results. Then performing a comparison test, all products should be configured in the same or similar way. For example, when comparing network performance of virtualized systems, we should try to use the same networking setup (say, bridged networking). Finally, all the test results, both good and bad, should be analyzed and explained. Analysts are usually done only for bad results, and good ones are taken for granted. The thing is, in some cases good results mean there's something wrong with the test itself. If you can't explain your test results, they are totally useless, except maybe for marketing purposes.
Now let me show some results of our performance measurements. We compared XEN, ESXi, KVM and OpenVZ. I choose a LAMP test, because most of out customers are hosting providers. From the following results you can understand how well such type of workloads run in virtualized environment and how many web servers can you run on a single piece of hardware.
On this slide you can see the number of virtual machines affects performance, measured in the number of serviced requests per second. Here we can see that in case of 20 VMs all the products have very similar performance. In case of 40 VMs performance difference becomes more obvious. In case of 60 VMs we can see that all products except for OpenVZ have worse performance than with 40 VMs. This is because the system is too small to handle that amounts of VMs. With OpenVZ, containers are more lightweight so you can have greater number of containers than you could have VMs. In other words, OpenVZ density is higher.
Indeed, OpenVZ high container density is an important feature, so we regularly compare it to other products and try to improve. For that, we have a special density test. This test simulates a typical web hosting workload. Each container has an web server, mail server (with Spam Assasin and an Anti-virus) and Parallels Plesk Panel. This test tries to simulate a workload by sending requests to each service with a defined frequency. On each iteration of the test we add some more containers and measure service response time, making sure it is below a certain limit. Test is stopped when response time is bad. Test result is the number of containers for which the response time is still good. As for every other test, if we see a regression, we try to understand why it happened, and from time to time we find interesting things. For example, last time we found out that the directory entry cache shrinker was too aggressive doing its work, slowing down the whole system.
One more good test is a week load test. It is one of few tests which creates a non-synthetic workload, it replays of real users apache logs. We have many our own tests for testing OpenVZ specific features and use foreign test suites for other functionality.
Now I want to tell a real life story of how one of my colleagues, has fixed a bug in the Linux kernel, causing a comment from Andrew Morton about russian bugfinding machine. In the course of OpenVZ kernel testing, our QA (Quality Assurance) team found a leak of 78 bytes of kernel memory. Who cares about 78 bytes, especially on a server with 16 gigabytes of RAM? We do. We checked the beancounters debug information which showed that one struct user object has leaked. He then tried to reproduce that but with no luck. Bugs that can not be reproduced are hard. The only option left was to audit the kernel source code. That involved finding all the places where struct user object is referenced, and checking the code correctness. It took him 4 hours to do the audit, and he found one place where the reference to an object might be lost. The bug was present not ony OpenVZ kernel, but in the mainstream kernel too. In this case, after the problem was found, fixing it was pretty simple. So he wrote a fix and a demo code to trigger the bug, tested the fix and sent it to Linux kernel mailing list. Why is this particular incident so important? It's OpenVZ resource limiting code which helped to detect the leak in the first place -- as the bug is very hard to trigger and the leak is small enough that it might not be discovered at all. This bug is in fact a security issue. An ordinary user could exploit the bug and eat all the kernel memory, thus bringing the whole system down. Worse scenarios could be possible as well. Incidentally, OpenVZ is protected from this security issue -- because the kmemsize beancounter (which helped to found it) limits kernel memory usage per Container.
. About a year ago a DoS exploit which leads to system unresponsiveness was published. It looks like most kernels are indeed vulnerable. The good news is OpenVZ is not vulnerable. Why? Because of user beancounters. The nature of exploit is to create an unlimited number of sockets, thus rendering the whole system unusable so you need to power-cycle it to bring it back to life. Now, if you run this exploit in an OpenVZ container, you will hit the numothersock beancounter limit pretty soon and the script will exit. I went further and set numothersock limit to 'unlimited', and re-run the exploit. The situation is much worse in that case, the system slows down considerably, but I was still able to login to the physical server using ssh and kill the offending task from the host system using SIGTERM. Now, another beancounter, kmemsize, is working to save the system. Of course, if you set all beancounters to unlimited, exploit will work. So don't do that, unless your CT is completely trusted. Those limits are there for a reason, you know.
One of OpenVZ team members, Kirill Kolishkin, decided to suspend a container, but forgot to specify one parameter. Vzctl returned an error, that this parameter wasn't specified. When Kir executes vzctl with correct parameters, it returned the error “No such container”. After small investigation, he found that the config file disappeared. Kir didn't guess what the problem in a minute, but then he's understood how it may be reproduced and where the problem in the code. Now look at this code: This code allocates one variable on the stack, then validates a parameter and initialized the variable. While we do not see anything strange, but let's see what will occur, if the parameter is invalid. Oh, not. The code in the error path uses the uninitialized variable, it removes a file with name from this variable. By some chance, this variable contains the path to the container's config. Bad luck. GCC doesn't report any warning in this case.
One hot summer day, my colleague made performance measurements of network namespaces. He got some results, which look like a set of random data. It's not first measurements and the procedure was well tested. Where is a problem? The day was hot, a brain worked not well and probably not brain only. It required more then one hour, that he noticed a note about CPU throttling due to overheating. The host had not a body fan, after it is set up, the results is stabilized. What is conclusion of this story? Make sure, that the results is reproducible and remember about sideeffects.