Java App GC 性能分析和异常处理

JVM 发展到今天，已经相当成熟。如果我们使用 G1作为垃圾回收方案，则配置上更是轻松很多，除了暂停时间和 xms、xmx，其他几乎都不用管。

当然，这是理想情况。实际工作中，无论是开源工具还是企业自己开发的 Application，都会出现性能问题，这些问题，又大多和 GC 有关。本文总结一下常见的一些 GC 异常场景和它们的处理方法。

目的

GC 优化的目的，是让程序达到尽可能高的 throughput（吞吐量）：即 JVM 在处理业务的时间和在维护自身的时间（主要是 gc）的比例。通常这个数字应该在95%以上。

如果你发现 Java 程序的吞吐量低，可以从以下4个原因着手分析：

内存泄漏
长时间的 gc pause
连续的 full GC
等待 I/O 、CPU 等

处理方案

内存泄漏

首先我们看一下 most notorious 的第一个原因，Memory Leak。主要现象，是 log 中出现的 OutOfMemory 异常。这类 OOM 错误一般有8种。其中，5类在 JVM heap上：

Java heap space
GC overhead limit exceeded
Requested array size exceeds VM limit
Permgen space
Metaspace

另外3类是：

Unable to create new native thread
Kill process or sacrifice child.
reason stack_trace_with_native_method

分析这8类问题的形成机制，已经有珠玉在前，这里不再赘述。这类问题的具体分析，依赖 heap dump工具，MAT(Eclipse Memory Analyzer) 就是一个比较理想的工具。

Long GC pause

导致long gc pause 的主要原因有：

创建对象太多太快
young gen 的 size 太小
GC 算法选择不当
系统发生了 swapping
GC thread 不够
后台 IO 太多
显式调用了 System.gc()，这会导致 STW（stop the world）
Heap size 设置的过大

接下来，我们依次讲解一下这些问题的处理方式。

创建对象太快，可以通过 Heap dump 来发现，解决方案需要具体分析，通常是优化程序，或者增加 app worker 个数。

Young gen 的 size 太小，可以通过修改以下两个参数来调整：

-Xmn：直接调整 Young generation 的大小
-XX:NewRatio：这个参数调整 old/young 的比例。如果你用 G1，不要设置

GC 算法选择的问题比较复杂。总的来说建议在大内存的服务器（超过48g）、JDK 8以上的环境，使用G1。G1只需要设置这个参数：-XX:MaxGCPauseMillis。

Swap 的问题是一个很常见的问题。由于需要的页被换出内存，导致未命中的情况，会产生大量的磁盘 IO，严重降低 GC 速度。因此在吞吐量要求高的系统上，需要禁用 swap。

dstat工具、或者更常见的free -h命令都能够用来检测 swapping。如果你想知道具体哪些进程发生了 swapping，可以用下面这个脚本：

#!/bin/bash
# Get current swap usage for all running processes
# Erik Ljungstrom 27/05/2011
# Modified by Mikko Rantalainen 2012-08-09
# Pipe the output to "sort -nk3" to get sorted output
# Modified by Marc Methot 2014-09-18
# removed the need for sudo

SUM=0
OVERALL=0
for DIR in `find /proc/ -maxdepth 1 -type d -regex "^/proc/[0-9]+"`
do
    PID=`echo $DIR | cut -d / -f 3`
    PROGNAME=`ps -p $PID -o comm --no-headers`
    for SWAP in `grep VmSwap $DIR/status 2>/dev/null | awk '{ print $2 }'`
    do
        let SUM=$SUM+$SWAP
    done
    if (( $SUM > 0 )); then
        echo "PID=$PID swapped $SUM KB ($PROGNAME)"
    fi
    let OVERALL=$OVERALL+$SUM
    SUM=0
done
echo "Overall swap used: $OVERALL KB"

一般来说，对于一个专用的 Java server，关闭 swapping 是最理想的选择。如果实在无法做到，可以通过增加内存、降低 heap size、关闭其他不相干进程等方式，缓解 swapping 的问题。

GC thread 不够的问题，不太常见。因为通常服务器的 CPU 都是多核的，分配多核给 GC thread 也是很普遍的行为。GC thread 不够，可以从 GC log 中看出来，例如：

[Times: user=25.56 sys=0.35, real=20.48 secs]

这里，real 时间，是 wall clock，即真实消耗的时间。user/sys，分别表示在用户态和核心态消耗的时间。如果有多个线程同时工作，时间会累加起来。如果我们有5个 GC thread，那么 user 应该差不多等于 real 的 5倍。如果 real 时间比较大，而 user 比 real 大的倍数不多，那么我们就需要更多的 GC thread 了。

后台 IO 过多的现象同样可以通过 dstat 等工具发现，还有个比较实用的技巧就是和上面的问题一样，看 GC Times。如果 real 比 user 多，那么毫无疑问很多时间被用在了 IO 上。现在的 server 一般多采用异步模式，这个问题出现的概率应该不高。

System.gc() 的调用，也会导致 STW。该调用的来源可能有以下几种：

显式调用
第三方库
RMI
JMX

可以通过 -XX:+DisableExplicitGC 来阻止程序显式调用 GC。

最后，如果 Heap Size 过大也会影响 GC 速度。但是这点我不是很确定。理论上大的 Heap 会降低 GC 的频率，影响到底有多大，需要具体分析。

连续 Full GC

Full GC 占用的系统资源很高。它会清理 Heap 中所有的 gen（Young, Old, Perm, Metaspace）。在 Full GC 中很多步骤是 STW 的，如 initial-mark，remark， cleanup。在这个过程中，业务代码停止运行，JVM 用全部的 CPU 来执行 GC，同时也会导致 CPU 占用率飙升。总的来说，Full GC 应当避免，连续 Full GC 更应该避免。

连续 Full GC 的原因，有以下几类：

并发模式失败
G1启动了标记周期，但在Mix GC之前，老年代就被填满，这时候G1会放弃标记周期。
promotion 失败或者 evacuation 失败
在进行 GC 的时候，没有足够的内存供存活对象或晋升对象使用，由此触发了Full GC。日志中通常会出现以下字样："evacuation failure", "to-space exhausted", "to-space overflow"。如果你是从 CMS 之类的 GC 切换到 G1，记得把分配 Heap 比例的几个选项关闭。另外，有几个选项对这个现象有一定影响，如 -XX：InitiatingHeapOccupancyPercent 和 -XX:G1ReservePercent。如果你试图改一改这个，确保你先看过这篇文章。这里我把原文关键部分摘录，供大家参考。
1. Find out if the failures are a side effect of over-tuning - Get a simple baseline with min and max heap and a realistic pause time goal: Remove any additional heap sizing such as -Xmn, -XX:NewSize, -XX:MaxNewSize, -XX:SurvivorRatio, etc. Use only -Xms, -Xmx and a pause time goal -XX:MaxGCPauseMillis.
1. If the problem persists even with the baseline run and if humongous allocations (see next section below) are not the issue - the corrective action is to increase your Java heap size, if you can, of course
1. If increasing the heap size is not an option and if you notice that the marking cycle is not starting early enough for G1 GC to be able to reclaim the old generation then drop your -XX:InitiatingHeapOccupancyPercent. The default for this is 45% of your total Java heap. Dropping the value will help start the marking cycle earlier. Conversely, if the marking cycle is starting early and not reclaiming much, you should increase the threshold above the default value to make sure that you are accommodating for the live data set for your application.
1. If concurrent marking cycles are starting on time, but are taking a lot of time to finish; and hence are delaying the mixed garbage collection cycles which will eventually lead to an evacuation failure since old generation is not timely reclaimed; increase the number of concurrent marking threads using the command line option: -XX:ConcGCThreads.
1. If "to-space" survivor is the issue, then increase the -XX:G1ReservePercent. The default is 10% of the Java heap. G1 GC creates a false ceiling and reserves the memory, in case there is a need for more "to-space". Of course, G1 GC caps it off at 50%, since we do not want the end-user to set it to a very large value.
巨型对象分配失败
这个问题我们并没有遇到过，但是如果你遇到了，日志里会有以下字样：

1361.680: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 1459617792 bytes, allocation request: 4194320 bytes, threshold: 1449551430 bytes (45.00 %), source: concurrent humongous allocation]

可以通过增加 Heap size 或者增大 -XX:G1HeapRegionSize 来解决。

总之，Full GC 根源还是 JVM heap 大小分配的不够，意味着 JVM 需要更多 heap 空间。处理方法：

增加 Heap size
增加 perm gen/metaspace size。
更多的机器！

除了上述的常规处理方案，我们也需要考虑程序 bug 的情况。举例来说，我们的 Cassandra 集群就经常遇到以下的连续 Full GC：

2018-11-30T16:28:57.196+0800: 269569.472: [Full GC (Allocation Failure)  23G->22G(24G), 80.8007295 secs]
2018-11-30T16:30:22.152+0800: 269654.428: [Full GC (Allocation Failure)  23G->22G(24G), 83.1106023 secs]
2018-11-30T16:31:48.893+0800: 269741.169: [Full GC (Allocation Failure)  23G->22G(24G), 80.5397583 secs]
2018-11-30T16:33:13.391+0800: 269825.666: [Full GC (Allocation Failure)  23G->22G(24G), 83.3309248 secs]
2018-11-30T16:34:40.599+0800: 269912.874: [Full GC (Allocation Failure)  23G->22G(24G), 80.2643310 secs]

虽然官方没有确认，但我觉得这是一个 bug：https://issues.apache.org/jira/browse/CASSANDRA-13365

总结

G1 的优化，就是尽量不要设置多余的参数，让它自己处理。如果你实在忍不住要动，可以参考以下文章：

最后，GC 的分析，有一个很好用的在线工具，gceasy.io，上传 GC log 文件就可以进行在线智能分析。本文也大量参考了 gceasy 的文档。

Java App GC 性能分析和异常处理