After getting frustrated with Jenkins getting slow or even stuck occasionally, I decided to investigate the root cause for that bad performance. In this talk I will show my findings about that issue (hint: GC), how Jenkins performance can be easily improved by tuning the JVM GC and a few exciting tools I've discovered along the way.
Originally presented at Jenkins TLV meetup @ Taboola
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Why does my jenkins freeze sometimes and what can I do about it?
1. Why does my Jenkins freeze sometimes and what
can I do about it?
Jenkins session you may like
2.
3.
4. Taboola - Numbers
On Average Every American Sees Taboola 70 Times a Month (comScore)
500K+ Requests/sec
15B recommendation/day
1B Monthly unique users
17TB+ incoming raw data/day
8 Data-Centers globally, with over ~2500 Production servers
5. Taboola - Jenkins
4 Jenkins servers
48 slaves
Hundreds of builds / day
~ 5 releases per day
29. Resources
● Joining the Big Leagues: Tuning Jenkins GC For Responsiveness and
Stability
● JVM memory model
● Getting Started with the G1 Garbage Collector
● Everything I Ever Learned about JVM Performance Tuning @twitter
● Here’s How Garbage Collection in JAVA Really Work
● Java HotSpot VM Options
● Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities
Notes de l'éditeur
In order to develop and deliver fast, you need fast tools.
Taboola - The meetup host
Examples for Taboola’s widgets
Taboola’s usage scale
Our Jenkins environment. We have 4 Jenkins master, on of them is the major Jenkins that serves the major builds. We have about 48 slaves, hundreds of builds per day with lots of automation: unit testing, integration, selenium and more. We use Jenkins pipeline widely.
Technologies used at taboola
I prepared a list of topics that will assist solving issues like the freeze issue.
So, Jenkins freezes sometimes, when I first encountered that issue I start investigating what is the cause. I saw that the Jenkins machine was not loaded (CPU, Memory), so I understood that it is something applicative.
Then, I found a blog post that suggested that the issue is the JAVA garbage collector, and followed that suggestion. In order to understand the garbage collector, let's have a quick view on the JVM and some its components
The JVM consist of some parts for example: Class loader, Runtime data areas and the execution engine. The highlighted components, the JIT Compiler, Garbage collector and the heap are the major components that affects the JVM performance.
The JIT Compiler is not in the scope of that talk, but we will talk about the garbage collector, and for that we need to talk about the heap first.
Let’s focus on the data runtime areas. We see that there are parts that are shared between the threads like the heap, and part that are per thread like the stack.
Most of us know or heard about the heap and the stack but what are they actually?
The heap and the stack are memory sets, each one stores different elements.
The stack for examples, holds local variables, primitives and addresses.
The heap stores objects.
In the example we can see that:
We define int x=99 and it is stored in the stack
We define Counter c1 and it is also stored in the stack (it is a pointer to a null object)
When we create an instance of Counter, the instance is stored in the heap and the address is stored in the stack.
I said that the heap has an effect on the JVM performance, and the thing that most of us do (in relation to the heap) is to define its maximum size.
Pop quiz:
We have 2 heaps, each has a different size, in which heap do you think we can store more objects?
The correct answer is the 32GB heap. The reason is special mechanism called UseCompressedOop.
This mechanism shrinks the object's pointer from 64bit to 32bit (on 64bit platforms), therefore create a lot more space in the heap.
The mechanism works by default on heaps up to 32GB (actually a little less), and then stops. When it stops, that means that the pointer grows back to 64bit and we have much less space in the heap.
To compensate that we need to create a 48GB heap!
So, when you define your heap size, try to keep it less than 32GB or above 48GB.
To define the maximum heap size we use -Xmx
-Xms is used to define the initial heap size.
In order to check of UseCompressedOops works, we can use jinfo which is a tool that comes with the java JDK.
Let’s have a look on the heap areas.
We have Eden, this is the area where most of the objects are created and removed (collected)
The survivor areas - Objects that survive collection in the eden, are promoted to a survivor area. Object that survive a collection in the survivor area is move to the other survivor areas in the first N cycles. After the Nth cycle the object gets promoted to the old generation.
Eden + survivor = Young
Old = Tenured
Now let’s talk about the garbage collector
What is garbage collector?
The garbage collector is an automatic memory management process.
One of its main goal is to identify unused object in the heap and release the memory that they hold.
(Note: unused objects are object that can no longer be refer to)
There are a few different implementation for the garbage collector.
You can check which garbage collector your application use by using the command:
Jmap -heap <pid>
In the example I started a Jenkins instance on my laptop and got:
Parallel GC with 4 threads
This is the GC type.
There are 4 types of GC: Serial, Parallel (which we saw earlier), CMS - Concurrent Mark and Sweep and G1.
The first 2, Serial and Parallel stops the application when they run. This pause is called “stop the world” pause.
The other 2, CMS and G1, tries to keep the pause time to the minimum while doing some of their tasks concurrently with the application. They are also called Mostly concurrent collectors.
The G1 is the newer and is going to be the default GC in Java 9.
The solution I present here uses G1.
G1 divides the heap a bit different, it breaks the heap to lots of small regions sizes 1Mb to 32MB. Those region are combined logically to the same heap areas we saw earlier (Eden, Survivor and Old gen)
One of the first actions the G1 does it to pick the top garbage regions - the region that has the most unused / unreferenced objects - and collects them.
Its name comes from that action, G1 = garbage first.
G1 has a default pause time target of 200ms, and it will try to make it by learning from previous collections and selecting region which it can collect on time.
Now that we have the background, let’s see the general guidelines to solve the freeze issue.
First enable G1GC by using the JVM flag -XX:+UseG1GC
In addition, add GC log flags so we can analyze the log behavior.
In the example you can see some of the log flags I use.
A record in the log will look like this.
There is a lot of data here: type of action, time, duration, memory before and after, and a lot more.
In order to analyze it we need an automatic tool.
I use GCeasy.
This tool allows you to upload GC logs and it performs analysis on them.
It will tell you if there is a problem and suggest a solution.
In addition it will display a report with some data.
In the example report you can see that the throughput is about 60%. Throughput is the percentage of time that the application run. Which means that the application stopped for about 40% of the time. This is a very low throughput.
You can also see the the maximum garbage collection time is almost 10 seconds, where we talked about a target of 200ms.
So we can definitely understand that something is not working well.
In this report you can see the duration of the GC over time. At first there were only young collections that performed well, but then it started to perform full collection (on the entire heap) and it takes lots of time.
Those cases indicates that tuning is required.
Tuning is done by adding (or removing) JVM flags. Here are some JVM flags that are related to the G1GC (the numbers are the default values):
-XX:MaxGCPauseMillis=<n> (default 200ms) - for changing the default pause time
-XX:InitiatingHeapOccupancyPercent=45 - the heap occupancy percentage where the GC starts its cycle
-XX:ParallelGCThreads=15 - amount of threads to be use in GC actions that “stop the world”
-XX:ConcGCThreads=4 - amount of threads to be used in GC concurrent actions
-XX:+UseStringDeduplication - String deduplication is a G1 feature that saves memory that is taken by duplicated string in the heap.
-XX:+DisableExplicitGC - disable the option to perform system.gc() calls which can lead to unnecessary GC cycles
Adding flags does not end the process, it will take you a few cycles until you find the proper flags for your application.
Note: after switching to G1 and adding flags, you need to give your application some time to run in order to get enough data in the logs.
Monitoring
There are a lot of monitoring tools, but a nice tools that comes with java is the Jconsole.
Jconsole can connect to the process (locally or remote), and show some data about the JVM. the data includes: general JVM data (flags, heap size, etc), over time graphs of CPU usage, classes count, thread count and memory usage.
You can identify issues with the heap graph, for examples if the graph show that the memory usage gets higher and higher over time, or if the GC cycles don’t collects enough data.
In order to enable the ability to connect with Jconsole, you need to add the JMX system properties to the startup of your application.
The properties are: host, port and authentication.
Then you can open Jconsole, type the host and port and connect.
The last item I want to mention is the Jenkins old data mechanism. (Jenkins -> Manage Jenkins -> Manage Old data)
This is a backward compatibility mechanism. When the format of the files on the disk is changed, Jenkins keeps the old format and loads the new format to the memory.
This can also cause to redundant GC cycles or even out of memory.
So make a habit (or a script) and from time to time check if the data in that page is still relevant, if not remove it.
To sum it up, In order to solve the freeze issue:
Enable G1GC and logs
Analyze the current behavior
Add relevant flags (you’ll need to understand the flags you add)
Repeat steps 2, 3 until satisfied
Keep monitoring
Remove old data from time to time (especially after updating plugins or core)