Tuesday, March 19, 2019

Pause Every Thing and Stop The World (User,Sys and Real Time in GC)


These days, the biggest impact on Java’s performance comes from its garbage collection. Fortunately, in many cases, it can be tweaked and optimized to improve performance.

Different events can cause the JVM to pause all the application threads. Such pauses are called Stop-The-World (STW) pauses. The most common cause for an STW pause to be triggered is garbage collection.

Garbage collector thread is a daemon thread. Daemon thread is a low priority thread which runs intermittently in the back ground doing the garbage collection operation or other requests for the java runtime system.


Stop The World (STW) Pause


Stop the world will occur no matter which GC algorithm you choose.

Stop the world means that the JVM is stopping the application from running to execute a GC. When stop-the-world occurs, every thread (all mutator threads / non daemon threads) be simultaneously suspended for some period of time except for the threads needed for the GC will stop their tasks. The interrupted tasks will resume only after the GC task has completed. This is typically done to collate the global roots including the stacks of all threads and in production GCs is a submillisecond pause.


GC tuning often means reducing this stop-the-world time.

In our production environments, we have repeatedly seen that applications running in JVM  (Java Virtual Machine) occasionally experience large STW (Stop-The-World) application pauses due to JVM’s GC logging being blocked by background IO traffic (e.g., OS page cache writeback).  During such STW pauses, JVM pause all application threads, and applications stop responding to user requests hence incurring unacceptable delays to latency-sensitive use cases.




 

 Pauses are induced by the JVM GC

  • JVM GC needs to log GC activities by issuing write() system calls;
  • Such write() calls can be blocked due to background disk IO ; even in an asynchronous (i.e., buffered IO or non-blocking IO) write mode, can still be blocked for considerable time by OS mechanisms including page cache writeback.
  • GC logging is on the JVM pausing path, hence the time taken by write() calls contribute to JVM STW pauses.
Example : GC logging process (User/Kernel Modes)

GC logging process is on the JVM’s STW pausing path, and the time taken for logging is part of STW pause. Specifically, the entire application pause mainly consists of two parts: pause due to JVM GC activity and pause due to OS blocking write() system call corresponding to JVM GC logging. The following diagram shows the relationship between them.



For latency-sensitive java applications, an immediate solution should be avoiding the IO contention by putting the GC log file on a separate HDD or high-performing disk such as SSD , tmpfs.

Some GC-induced STW pauses that scan/mark/compact heap objects are well known, we found out that there are some large STW pauses caused by background IO traffic.  In production environments, we have seen unexplainable large STW pauses in our mission-critical Java applications.

In general GC does not require Stop-the-World pause. There are JVM implementations which are (almost) pause free (e.g. Azul Zing JVM). Whenever JVM require STW to collect garbage depends on algorithm it is using.



In Garbage Collection, there are two types of events:
  • An event that fully pause the application. These events are also called “Stop the world events”. No customer transactions will be processed during this event. Against most common belief, “Young GC” falls under this category. When “Young GC” runs, the entire application is paused for the duration till it completes.
  • An event that partially pause the application
For example, if you are using G1 GC algorithm, old generation GC event has 5 phases. In those 5 phases, only 3 of them pause the application threads completely. Other 2 phases run concurrently with application threads without pausing the application threads.

G1 GC Phase
Type
Initial Mark
Stop the world Phase
Root Region Scanning
Concurrent Phase
Concurrent Marking
Concurrent Phase
Cleanup
Stop the world Phase
Remark
Stop the world Phase


CMS old generation GC has 6 phases. In those 6 phases, only 2 of them pause the application threads completely. Other 4 phases run concurrently with application threads.

CMS GC Phase
Type
Initial Mark
Stop the world Phase
Concurrent Mark
Concurrent Phase
Concurrent Preclean
Concurrent Phase
Concurrent Abortable Preclean
Concurrent Phase
Final Remark
Stop the world Phase
Concurrent Sweep
Concurrent Phase
Concurrent Reset
Concurrent Phase

Example: Enabled CMS GC and Captured loggers for the GC events (Minor/Major/Full GC)

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:/tmp/log/gc-$(date +%Y_%m_%d-%H_%M).log
-XX:+AlwaysPreTouch
-XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
-XX:+UseConcMarkSweepGC
-XX:+AggressiveOpts
-XX:+ParallelRefProcEnabled
-XX:+TieredCompilation
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/log/heap_out_of_memory.hprof


Full GC
2017-09-21T16:39:37.728-0800: 88.808: [Full GC [PSYoungGen: 116544K->12164K(233024K)[PSOldGen: 684832K->699071K(699072K)] 801376K->711236K(932096K)[PSPermGen: 2379K->2379K(21248K)], 3.4230220 secs] [Times: user=3.40 sys=0.02, real=3.42 secs]

Major GC

Phase 1: Initial Mark.
2017-09-21T16:23:07.321-0200: 64.425: [GC (CMS Initial Mark) [1 CMS-initial-mark: 10812086K(11901376K)] 10887844K(12514816K), 0.0001997 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

Phase 2: Concurrent Mark.
2017-09-21T16:23:07.357-0200: 64.460: [CMS-concurrent-mark: 0.035/0.035 secs] [Times: user=0.07 sys=0.00, real=0.03 secs]

Phase 3: Concurrent Preclean.

2017-09-21T16:23:07.373-0200: 64.476: [CMS-concurrent-preclean: 0.016/0.016 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]

Phase 4: Concurrent Abortable Preclean.
2017-09-21T16:23:08.446-0200: 65.550: [CMS-concurrent-abortable-preclean: 0.167/1.074 secs] [Times: user=0.20 sys=0.00, real=1.07 secs]

Phase 5: Final Mark.
2017-09-21T16:23:08.447-0200: 65.550: [GC (CMS Final Remark) [YG occupancy: 387920 K (613440 K)]65.550: [Rescan (parallel) , 0.0085125 secs]65.559: [weak refs processing, 0.0000243 secs]65.559: [class unloading, 0.0013120 secs]65.560: [scrub symbol table, 0.0008345 secs]65.561: [scrub string table, 0.0001759 secs][1 CMS-remark: 10812086K(11901376K)] 11200006K(12514816K), 0.0110730 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]

Phase 6: Concurrent Sweep.
2017-09-21T16:23:08.485-0200: 65.588: [CMS-concurrent-sweep: 0.027/0.027 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]

Phase 7:  Concurrent Reset.
2017-09-21T16:23:08.497-0200: 65.601: [CMS-concurrent-reset: 0.012/0.012 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

Minor GC

2017-09-21T11:10:34.831+0530:0.208 [ GC(AllocationFailure) 0.208 [ParNew:33280K->4160K(37440K), 0.0627725 secs]33280K->18152K(120768K), 0.0628398 secs] [Times: user=0.18 sys=0.01, real=0.07 secs]

0.0628398 secs – the GC took 0.06 seconds to complete.
[Times: user=0.18 sys=0.01, real=0.07 secs]
where,
real = 0.07 secs time is the total elapsed time of the GC event. This is basically the time that you see in the clock.
user = 0.18 secs time is the CPU time spent in user-mode code (outside the kernel).
sys = 0.18 secs time is the amount of CPU time spent in kernel-mode inside the kernel. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space.

REAL TIME, USER TIME , SYS TIME

Before even getting into GC Time, let’s just take couple minutes to learn about the Unix command “time.” , When you issue the below UNIX command:

time nc -z www.java2depth.com

You are going to see output like this:


The ‘time nc -z www.java2depth.com’ command first displays the output of execution of the “nc -z www.java2depth.com” command, which show the connectivity status for this URL.


Next you see the amount of time it’s taken to execute “nc -z www.java2depth.com”, which is:


Note here “real,” “user,” “sys” times are displayed. This is the same data that we see in GC logs


  • Real is wall clock time (time from start to finish of the call). This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
  • User  is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes, and the time the process spends blocked, do not count towards this figure.
  • Sys is the time spent in OS calls or waiting for system event i,e the amount of CPU time spent inside the kernel within the process (kernel-mode). This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like user, this is only CPU time used by the process.
Note:
  • User+Sys will tell you how much actual CPU time your process used. this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs).
  • In above case, if the CPU time (0.18 sec) is considerably higher than the real time passed (0.07 Sec), we can conclude that the GC was run using multiple threads.

C1: Real Time < User Time + Sys Time [Times: user=11.53 sys=1.38, real=1.03 secs]
  • That's because this log time is collected from the JVM, where multiple GC threads are configured on a multi-core/multi-processors server. As multiple threads execute GC in parallel, the workload is shared amongst these threads, thus actual clock time (‘real’) is much less than total CPU time (‘user’ + ‘sys’).
  • How is it possible that the real elapsed time is lower than user time? Well, it’s because the user code was executed by multiple threads and the statistics print the sum of all threads working on it. So basically, multiple GC threads were involved.

C2: Real Time = User Time + Sys Time [Times: user=0.09 sys=0.00, real=0.09 secs]
  • That’s a typical situation when the Serial Garbage Collector is used. Because it runs only one GC thread, user + sys time will be more or less similar to the real time.

C3: Real Time > User Time + Sys Time [Times: user=0.17 sys=0.00, real=3.18 secs]
  • In certain circumstances, you might see real time to be greater than user + sys time. If it happens often and the differences are substantial, then it might mean that there is either a lack of CPU or there is heavy I/O activity in the system. Lack of CPU is when the application doesn’t get enough CPU cycles to run because they are shared among too many processes. 
  • The solution might be to add more CPU cores or to reduce the number of the most CPU-consuming processes. Heavy I/O is well

How Mahatma Gandhi Helped Me

I still haven’t answered the question how Mahatma Gandhi helped me. Following was one of his immortal quotes on customers:

"A customer is the most important visitor on our premises. He is not dependent on us. We are dependent on him. He is not an interruption in our work. He is the purpose of it. He is not an outsider in our business. He is part of it. We are not doing him a favor by serving him. He is doing us a favor by giving us an opportunity to do so."

According to Gandhi, the Customer is the primary focus. Similarly, when you are looking at performance optimization, you are primarily optimizing response time for your customer. The customer doesn’t care how many GC threads you use or how many processors you have. What matters is how many seconds he has to wait before he sees his screen painted with data. So “real” time is the actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch.

But it doesn't mean 'sys' or 'user' times are not important. They are also important to see whether you want to increase your GC thread count or CPU processor count, to reduce your GC pause times.

How to Reduce Long GC Pauses

1. Heavy I/O activity or Background IO Traffic  - optimize your application’s I/O activity

When there is heavy I/O activity (i.e. networking/disk access/user interaction) on the server then real time tend to spike up to a great extent. As part of GC logging, your application makes a disk access. If there is a heavy I/O activity on the server then GC event might be stranded, causing spiked up real time.

Note:
  • Even if your application is not causing the heavy I/O activity, the other process on the server may cause the heavy I/O activity leading to the high real time.
  • When there is heavy I/O activity, you will notice the ‘real’ time to be significantly higher than ‘user’ time. 
  • Example: [Times: user=0.20 sys=0.01, real=18.45 secs]
When this happens, here are some potential solutions to solve it:
  • If high I/O activity is caused by your application, then optimize it.
  • Eliminate the processes that are causing high I/O activity on the server.
  • Move your application to a different server where there is less I/O activity.
You can monitor I/O activity on your server, using the sar (System Activity Report), in Unix.


 2. High Object Creation Rate
  • If your application’s object creation rate is very high, then to keep up with it, the garbage collection rate will also be very high. A high garbage collection rate will increase the GC pause time as well. 
  • Thus, optimizing the application to create fewer objects is THE EFFECTIVE strategy to reduce long GC pauses. This might be a time-consuming exercise, but it is 100% worth doing. In order to optimize the object creation rate in the application, you can consider using Java profilers like JProfiler, YourKit, or JVisualVM.
These profilers will report:
  • What are the objects that have been created?
  • What is the rate at which these objects are created?
  • What is the amount of space they are occupying in memory?
  • Who is creating them?
Always try to optimize the objects that occupy the most amount of memory. Go after the big fish in the pond.

Upload your GC log to the Universal Garbage Collection log analyzer tool GCeasy. This tool will report the object creation rate. There will be field by name ‘Avg creation rate’ in the section ‘Object Stats.’ This field will report the object creation rate.


3. System.gc() Calls

When System.gc() or Runtime.getRuntime().gc() method calls are invoked, it will cause stop-the-world full GCs. During stop-the-world full GCs, the entire JVM is frozen (i.e. No user activity will be performed during this period). System.gc() calls are made from one of the following sources:
  • Your own developers might be explicitly calling the System.gc() method.
  • It could be third-party libraries, frameworks, or sometimes even application servers that you use. Any of those could be invoking the System.gc() method.
  • It could be triggered from external tools (like VisualVM) through the use of JMX.
  • If your application is using RMI, then RMI invokes System.gc() on a periodic interval. This interval can be configured using the following system properties:
-Dsun.rmi.dgc.server.gcInterval=n 
-Dsun.rmi.dgc.client.gcInterval=n  

Evaluate whether it’s absolutely necessary to explicitly invoke System.gc(). If there is no need for it, then please remove it. On the other hand, you can forcefully disable the System.gc() calls by passing the JVM argument: -XX:+DisableExplicitGC. For complete details on System.gc() problems
How to Know Whether System.gc() Calls Are Explicitly Called?

Upload your GC log to the Universal Garbage Collection log analyzer tool GCeasy. This tool has a section called ‘GC Causes.’ If GC activity is triggered because of ‘System.gc()’ calls, then it will be reported in this section. See the image (which is an excerpt from the GCeasy-generated report), showing that System.gc() was made four times during the lifetime of this application.

 4. Undersized Young Generation
  • When the young generation is undersized, objects will be prematurely promoted to the old generation. Collecting garbage from the old generation takes more time than collecting it from the young generation. 
  • Thus, increasing the young generation size has the potential to reduce long GC pauses. The young generation's size can be increased by setting one of the two JVM arguments
-Xmn: specifies the size of the young generation.
-XX:NewRatio: Specifies the size of the young generation relative to the old generation.

For example, setting -XX:NewRatio=2 means that the ratio between the old and young generation is 1:2. The young generation will be half the size of the overall heap. So, if the heap size is 2 GB, then the young generation size would be 1 GB.


Note :
  • Oracle recommends setting the minimum heap size (-Xms) equal to the maximum heap size (-Xmx) to minimize garbage collections. Heap size is usually between ¼ and ½ of system memory.
Maximum Heap Size Formula: max (min (1/2 ram, 1024MB), min (1/4 ram, 8GB)
Old Generation Formula: 2/3 * (-Xmx)
Young Generation Formula: 1/3 * (-Xmx)




  • By default, the Application Server is invoked with the Java HotSpot Server JVM. The default NewRatio for the Server JVM is 2: the old generation occupies 2/3 of the heap while the new generation occupies 1/3. The larger new generation can accommodate many more short-lived objects, decreasing the need for slow major collections. The old generation is still sufficiently large enough to hold many long-lived objects The ratio between the old and young generation is 1:3, the combined size of eden and the survivor spaces will be fourth of the heap.

5. Choice of GC Algorithm
  • Your GC algorithm has a major influence on the GC pause time. If you are a GC expert or intend to become one (or someone on your team is a GC expert), you can tune GC settings to obtain optimal GC pause time. If you don’t have a lot of GC expertise, then I would recommend using the G1 GC algorithm because of its auto-tuning capability. 
  • In G1 GC, you can set the GC pause time goal using the system property ‘-XX:MaxGCPauseMillis.’ 
  • Example: -XX:MaxGCPauseMillis=200
As per the above example, the maximum GC pause time is set to 200 milliseconds. This is a soft goal, which JVM will try it’s best to meet it.  

6. Process Swapping
  • Sometimes due to a lack of memory (RAM), the operating system could be swapping your application from memory. Swapping is very expensive, as it requires disk accesses, which is much slower compared to physical memory access. 
  • In my humble opinion, no serious application in a production environment should be swapping. When the process swaps, GC will take a long time to complete.
If you find your processes are swapping, then do one of the following:
  • Allocate more RAM to the serve.
  • Reduce the number of processes that running on the server so that it can free up the memory (RAM).
  • Reduce the heap size of your application (which I wouldn’t recommend, as it can cause other side effects. Still, it could potentially solve your problem).
Find the shell script from StackOverflow (thanks to the author) that, when executed, will show all the processes that are being swapped. Please make sure your process is not getting swapped.

7. Tuning the Number of GC Threads

For every GC event reported in the GC log, user, sys, and real times are printed. Example:

[Times: user=25.56 sys=0.35, real=20.48 secs]

If, in the GC events, you consistently notice that ‘real’ time isn’t significantly less than the ‘user’ time, then it might be indicating that there aren’t enough GC threads. Consider increasing the GC thread count. Suppose ‘user’ time is 25 seconds and you have configured the GC thread count to be 5, then the real time should be close to 5 seconds (because 25 seconds/5 threads = 5 seconds).

WARNING: Adding too many GC threads will consume a lot of CPU and take away resources from your application. Thus you need to conduct thorough testing before increasing the GC thread count.

8. Lack of CPU – Reduce the number of processes that is running on the server
  • If multiple processes are running on your server and if your application doesn’t get enough CPU cycles to run, it will start to wait. When the application starts to wait then real time will be considerably higher than user + sys time.
  • You can use the commands like ‘top’ or profiling tools like ‘jProfiler’ to observe the CPU utilization on the application server.
User/Kernel Mode - (User/Sys)


The user process executes in the user mode until it gets a system call. Then a system trap is generated and the mode bit is set to zero. The system call gets executed in kernel mode. After the execution is completed, again a system trap is generated and the mode bit is set to 1. The system control returns to kernel mode and the process execution continues.

User Mode:

  • In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.
  • The mode bit is set to 1 in the user mode. It is changed from 1 to 0 when switching from user mode to kernel mode.

Kernel Mode: 

  • Allocating memory (malloc or fread/fwrite) or accessing hardware (HDD, network, etc.
  • In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.
  • The mode bit is set to 0 in the kernel mode. It is changed from 0 to 1 when switching from kernel mode to user mode.
GC Attacks by Linux

In Linux environment it may happen that an IMDG based application faces with long GC pauses due to I/O or memory starvation or other kernel specific settings.

[Times : user=0.17 sys=0.00 , real = 3.18 secs] [Times : user=0.00 sys=127.31 , real = 126.10 secs]

GC stopped the world for minutes but :
  • Did not real work(CPU time in user mode = 0)
  • Burned cycles in Linux kernel
1. IO Starvation :

Symptom : GC log shows "low user time, low system time , long GC pause"

Cause : GC threads stuck in kernel waiting for IO, usually due to journal commits or FS flush of changes by gzip of log rolling

Solutions :
  • Strategy : Even out workload to disk drives (flush every 5s rather than 30s)
sysctl -w vm.dirty_writeback_centisecs = 500
sysctl -w vm.dirty_expire_centisecs = 500
  • In Progress : Direct IO with gzip or gzip as-you-go
2. Memory Starvation :

Symptom : GC log shows "low user time, high system time , long GC pause"

Cause : Memory pressure triggers swapping or scanning for free memory

Solutions :
  • Strategy : Pre-allocate memory to JVM heap and protect it against swapping or scanning
  • Turn On : -XX:+AlwaysPreTouch option in JVM
sysctl -w vm.swappiness = 0
        It will protect heap and anonymous memory
  • JVM start up has 2 second delay to allocate all memory(17GB)
Modify kernel settings in order to overcome long GC pauses caused by Linux kernel :


1. I/O issues :

If GC log shows “low user time, low system time, long GC pause” then a reason could be with GC threads stuck in kernel waiting for I/O. Basically it happens due to journal commits or file system flush of changes by gzip of log rolling.
As a solution you can increase pages flushing to disk from defaul 30 seconds to 5 seconds
  sysctl –w vm.dirty_writeback_centisecs=500
  sysctl –w vm.dirty_expire_centisecs=500

2. Memory issues :

If GC log shows “low user time, high system time, long GC pause” then most likely memory pressure triggers swapping or scanning for free memory.
  • Check and decrease 'swappiness' setting to protect heap and anonymous memory.
  sysctl –w vm.swappiness=10
  • Add –XX:+AlwaysPreTouch to JVM settings on startup
  • Turn off NUMA zone-reclaim optimization
sysctl –w vm.zone_reclaim_mode=0
  • Turn off Transparent Huge Pages if RedHat distribution is used
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag


3. Page Cache :
  • In cases when an application interacts a lot with an underlying file system this can lead to the situation when RAM is highly utilized by page cache.
  •  If kswapd daemon doesn't keep up with pages reclamation, used by the page cache, in background then an application can face with high latencies due to direct reclamation when it needs a new page. 
  • This situation can affect not only the performance of the application but may also lead to long GC pauses.
To get over long GC pauses caused by direct page memory reclaim on Linux with the latest kernel versions you can add extra bytes between wmark_min and wmark_low with /proc/sys/vm/extra_free_kbytes setting trying to avoid aforementioned latencies.
sysctl -w vm.extra_free_kbytes=1240000

Direct reclaim memory by kernel is the major slowness for the Long GC Pause :



4. File Descriptors :


4(a). System File Descriptor Limit :


  • When running a large number of threads accessing the grid as in the case of large-scale server-side applications, you may end up with a large number of open files used both on client and server nodes. It is recommended that you increase the default values to the max defaults.
  • Misconfiguring the file descriptors settings will impact application stability and performance. For this we have to set both “System level File Descriptor Limit” and “Process level File Descriptor Limit”, respectively, by following these steps as a root user:
  • Modify the following line in the /etc/sysctl.conf file:
fs.file-max = 300000
  • Apply the change by executing the following command:
/sbin/sysctl -p

Verify your settings using:

cat /proc/sys/fs/file-max

Alternatively, you may execute the following command:

sysctl fs.file-max


4(b). Process File Descriptor Limit:


  • By default, Linux OS has a relatively small number of file descriptors available and max user processes (1024) configured. 
  • It is important that you use a user account which has its maximum open file descriptors (open files) and max user processes configured to an appropriate value.
A good maximum value for open file descriptors is 32768

Use the following command to set the maximum open file descriptors and maximum user processes:

ulimit -n 32768 -u 32768

Alternatively, you may modify the following files accordingly:

/etc/security/limits.conf

- soft    nofile          32768
- hard    nofile          32768

/etc/security/limits.d/90-nproc.conf

- soft nproc 32768

Inform :

No comments:

Post a Comment