问题描述
线上支付元数据系统出现告警,堆内存使用率超过95%
应用背景
应用说明:该系统存储了很多账户-银行的配置信息,供业务系统查询使用,业务系统调用是全量获取的,所以会产生些大对象(将list转化为json返回)
JVM参数如下:
-Xms6144m -Xmx6144m -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Xloggc:/home/finance/Logs/xxx/xxx-gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=50M
GC日志如下:
2020-09-16T19:29:36.138+0800: 1113411.729: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0486270 secs]
[Parallel Time: 44.3 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 1113411731.0, Avg: 1113411731.0, Max: 1113411731.0, Diff: 0.1]
[Ext Root Scanning (ms): Min: 3.8, Avg: 4.2, Max: 4.5, Diff: 0.6, Sum: 16.6]
[Update RS (ms): Min: 34.6, Avg: 34.7, Max: 34.7, Diff: 0.1, Sum: 138.7]
[Processed Buffers: Min: 184, Avg: 205.5, Max: 229, Diff: 45, Sum: 822]
[Scan RS (ms): Min: 1.0, Avg: 1.1, Max: 1.2, Diff: 0.2, Sum: 4.5]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 3.6, Avg: 4.0, Max: 4.3, Diff: 0.7, Sum: 15.9]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Termination Attempts: Min: 1, Avg: 1.2, Max: 2, Diff: 1, Sum: 5]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[GC Worker Total (ms): Min: 43.9, Avg: 44.0, Max: 44.0, Diff: 0.1, Sum: 175.8]
[GC Worker End (ms): Min: 1113411775.0, Avg: 1113411775.0, Max: 1113411775.0, Diff: 0.0]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.1 ms]
[Other: 4.3 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.2 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.1 ms]
[Humongous Register: 0.1 ms]
[Humongous Reclaim: 1.2 ms]
[Free CSet: 0.2 ms]
[Eden: 14.0M(3682.0M)->0.0B(3680.0M) Survivors: 4096.0K->6144.0K Heap: 2634.4M(6144.0M)->1491.6M(6144.0M)]
[Times: user=0.18 sys=0.00, real=0.04 secs]
2020-09-16T19:29:36.187+0800: 1113411.778: [GC concurrent-root-region-scan-start]
2020-09-16T19:29:36.203+0800: 1113411.793: [GC concurrent-root-region-scan-end, 0.0154486 secs]
2020-09-16T19:29:36.203+0800: 1113411.793: [GC concurrent-mark-start]
2020-09-16T19:29:36.637+0800: 1113412.227: [GC concurrent-mark-end, 0.4340025 secs]
2020-09-16T19:29:36.651+0800: 1113412.241: [GC remark 2020-09-16T19:29:36.651+0800: 1113412.241: [Finalize Marking, 0.0014060 secs] 2020-09-16T19:29:366
.652+0800: 1113412.243: [GC ref-proc, 0.0011151 secs] 2020-09-16T19:29:36.653+0800: 1113412.244: [Unloading, 0.0640499 secs], 0.0700085 secs]
[Times: user=0.24 sys=0.00, real=0.07 secs]
2020-09-16T19:29:36.723+0800: 1113412.313: [GC cleanup 2085M->1645M(6144M), 0.0041280 secs]
[Times: user=0.02 sys=0.00, real=0.00 secs]
2020-09-16T19:29:36.727+0800: 1113412.318: [GC concurrent-cleanup-start]
2020-09-16T19:29:36.728+0800: 1113412.318: [GC concurrent-cleanup-end, 0.0003667 secs]
2020-09-16T21:34:35.527+0800: 1120911.117: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 1.2081112 secs]
[Parallel Time: 226.4 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 1120911125.5, Avg: 1120911125.6, Max: 1120911125.7, Diff: 0.1]
[Ext Root Scanning (ms): Min: 4.0, Avg: 4.6, Max: 5.7, Diff: 1.6, Sum: 18.4]
[Update RS (ms): Min: 23.9, Avg: 24.0, Max: 24.2, Diff: 0.3, Sum: 96.1]
[Processed Buffers: Min: 98, Avg: 109.5, Max: 119, Diff: 21, Sum: 438]
[Scan RS (ms): Min: 2.4, Avg: 2.5, Max: 2.6, Diff: 0.2, Sum: 10.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
[Object Copy (ms): Min: 190.3, Avg: 193.9, Max: 195.4, Diff: 5.1, Sum: 775.5]
[Termination (ms): Min: 0.0, Avg: 0.9, Max: 3.7, Diff: 3.7, Sum: 3.8]
[Termination Attempts: Min: 1, Avg: 1.2, Max: 2, Diff: 1, Sum: 5]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.2]
[GC Worker Total (ms): Min: 226.0, Avg: 226.1, Max: 226.3, Diff: 0.3, Sum: 904.3]
[GC Worker End (ms): Min: 1120911351.6, Avg: 1120911351.7, Max: 1120911351.8, Diff: 0.2]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.8 ms]
[Other: 980.8 ms]
[Evacuation Failure: 968.8 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.2 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.2 ms]
[Humongous Register: 7.7 ms]
[Humongous Reclaim: 2.2 ms]
[Free CSet: 1.2 ms]
[Eden: 2942.0M(3680.0M)->0.0B(306.0M) Survivors: 6144.0K->0.0B Heap: 5889.8M(6144.0M)->2864.0M(6144.0M)]
[Times: user=4.17 sys=0.03, real=1.21 secs]
2020-09-16T21:56:31.801+0800: 1122227.392: [GC pause (G1 Evacuation Pause) (mixed), 0.0330733 secs]
[Parallel Time: 29.8 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 1122227392.3, Avg: 1122227392.4, Max: 1122227392.4, Diff: 0.1]
[Ext Root Scanning (ms): Min: 3.4, Avg: 3.8, Max: 4.5, Diff: 1.2, Sum: 15.2]
[Update RS (ms): Min: 21.0, Avg: 21.4, Max: 21.5, Diff: 0.5, Sum: 85.6]
[Processed Buffers: Min: 35, Avg: 50.8, Max: 59, Diff: 24, Sum: 203]
[Scan RS (ms): Min: 1.0, Avg: 1.2, Max: 1.5, Diff: 0.5, Sum: 4.9]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 2.6, Avg: 3.2, Max: 3.7, Diff: 1.1, Sum: 12.7]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[GC Worker Total (ms): Min: 29.6, Avg: 29.6, Max: 29.7, Diff: 0.1, Sum: 118.5]
[GC Worker End (ms): Min: 1122227422.0, Avg: 1122227422.0, Max: 1122227422.0, Diff: 0.0]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.3 ms]
[Other: 3.0 ms]
[Choose CSet: 0.1 ms]
[Ref Proc: 1.4 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.2 ms]
[Humongous Register: 0.1 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.7 ms]
[Eden: 306.0M(306.0M)->0.0B(2744.0M) Survivors: 0.0B->2048.0K Heap: 3170.0M(6144.0M)->2602.0M(6144.0M)]
[Times: user=0.11 sys=0.01, real=0.03 secs]
2020-09-16T21:34:35.527+0800: 1120911.117: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 1.2081112 secs]
这个暂停时间达到1.2s之长
巨型对象过多导致的内存碎片,引起G1 to-space exhausted
根据上述的GC日志,可以发现几个关键字: to-space exhausted, G1 Humongous Allocation。
查找相关的资料,如:
https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-2428DA90-B93D-48E6-B336-A849ADF1C552
文章中描述了巨型对象是如何定义的,只要对象大小>=1/2 region,那么这个对象就会被JVM标记为Humongous object
Humongous objects are objects larger or equal the size of half a region. The current region size is determined ergonomically as described in the Ergonomic Defaults for G1 GC section, unless set using the
-XX:G1HeapRegionSize
option.
每一个巨型对象是一组连续的region,分配在老年代。 如:如果region为1m,对象大小为600KB,那么剩余的空间(1024kb - 600kb)将成为碎片,不在参与分配。
Every humongous object gets allocated as a sequence of contiguous regions in the old generation. The start of the object itself is always located at the start of the first region in that sequence. Any leftover space in the last region of the sequence will be lost for allocation until the entire object is reclaimed.
G1的巨型对象内存碎片的验证
package GC;
import java.util.ArrayList;
import java.util.List;
/**
* -verbose:gc -Xms56m -Xmx56m -XX:G1HeapRegionSize=1M -XX:+UseG1GC -Xloggc:e:\g1\g1-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+HeapDumpBeforeFullGC -XX:HeapDumpPath=e:\g1\fullgc.dump
* Created by yongzheng.yang on 2017/7/20.
*/
public class G1GCTest {
static List<byte[]> list = new ArrayList<>();
public static void main(String[] args) throws InterruptedException {
// 10M内存Eden区容不下, S区也容不下, 直接进入Old区, 但是还不够30*50%=15M CMS GC触发条件
for (int i = 0; i < 56; i++) {
/*Heap
garbage-first heap total 57344K, used 31931K [0x00000000fc800000, 0x00000000fc9001c0, 0x0000000100000000)
region size 1024K, 2 young (2048K), 1 survivors (1024K)
*/
// region为1m 1024k,分配的内存为 612K >= 1/2m 所以该对象是Humongous-obj,占用1个region
// 当分配到53个后,老年代占53个区(Humongous Obj) , Eden 占1个,Survivor占2个,所以53以后没有空间在继续分配大对象了
// 就会报
byte[] byte10m1 = new byte[1 * 612 * 1024];
list.add(byte10m1); // 一直持有,不让回收
}
// 分配1kb 但此次内存已经使用完毕了,会触发一次java.lang.OutOfMemoryError: Java heap space
// 如果配置-XX:G1HeapRegionSize=2 指定每个region为2M,则此处不会抛异常
byte[] byte10m2 = new byte[1 * 1024];
Thread.sleep(30000);
}
}
执行到第52次的时候
C:\Program Files\Java\jdk1.8.0_172\bin>jmap -heap 22784
Attaching to process ID 22784, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.172-b11
using thread-local object allocation.
Garbage-First (G1) GC with 4 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 58720256 (56.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 34603008 (33.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 1048576 (1.0MB)
Heap Usage:
G1 Heap:
regions = 56
capacity = 58720256 (56.0MB)
used = 33761088 (32.19708251953125MB)
free = 24959168 (23.80291748046875MB)
57.49479021344866% used
G1 Young Generation:
Eden Space:
regions = 1
capacity = 11534336 (11.0MB)
used = 1048576 (1.0MB)
free = 10485760 (10.0MB)
9.090909090909092% used
Survivor Space:
regions = 2
capacity = 2097152 (2.0MB)
used = 2097152 (2.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 52
capacity = 45088768 (43.0MB)
used = 30615360 (29.19708251953125MB)
free = 14473408 (13.80291748046875MB)
67.90019190588663% used
1727 interned Strings occupying 155576 bytes.
第53次分配的时候,可以看到G1 OLD区的regions变为了53,说明new byte[1 * 612 * 1024]的分配对于1M的regions来说就是一个Humongou Obj 占用一个独立的region (1M),此时所有regions均与分配完毕,如果在来下一次分配会怎么样呢?
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 58720256 (56.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 34603008 (33.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 1048576 (1.0MB)
Heap Usage:
G1 Heap:
regions = 56
capacity = 58720256 (56.0MB)
used = 33787552 (32.222320556640625MB)
free = 24932704 (23.777679443359375MB)
57.53985813685826% used
G1 Young Generation:
Eden Space:
regions = 1
capacity = 11534336 (11.0MB)
used = 1048576 (1.0MB)
free = 10485760 (10.0MB)
9.090909090909092% used
Survivor Space:
regions = 2
capacity = 2097152 (2.0MB)
used = 2097152 (2.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 52
capacity = 45088768 (43.0MB)
used = 30641824 (29.222320556640625MB)
free = 14446944 (13.777679443359375MB)
67.95888501544331% used
1727 interned Strings occupying 155576 bytes.
第54次分配,heap分布情况如下,可以发现eden, suvivor的regions全变为0,这两个space的对象全部移到了old区(old区regions变为56)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 58720256 (56.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 34603008 (33.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 1048576 (1.0MB)
Heap Usage:
G1 Heap:
regions = 56
capacity = 58720256 (56.0MB)
used = 35100520 (33.474464416503906MB)
free = 23619736 (22.525535583496094MB)
59.77582931518555% used
G1 Young Generation:
Eden Space:
regions = 0
capacity = 3145728 (3.0MB)
used = 0 (0.0MB)
free = 3145728 (3.0MB)
0.0% used
Survivor Space:
regions = 0
capacity = 0 (0.0MB)
used = 0 (0.0MB)
free = 0 (0.0MB)
0.0% used
G1 Old Generation:
regions = 56
capacity = 55574528 (53.0MB)
used = 35100520 (33.474464416503906MB)
free = 20474008 (19.525535583496094MB)
63.15936682359228% used
观察GC日志,发生to-space exhausted的收集,查看了https://www.infoq.com/articles/tuning-tips-G1-GC/
这一步骤是非常耗时的,所以我们需要尽量避免由于空间不足引起的to-space exhausted
For G1 GC, an evacuation failure is very expensive -
For successfully copied objects, G1 needs to update the references and the regions have to be tenured.
For unsuccessfully copied objects, G1 will self-forward them and tenure the regions in place.
2020-10-22T11:26:25.291+0800: 227.567: [GC pause (G1 Humongous Allocation) (young) (to-space exhausted), 0.0087451 secs]
[Parallel Time: 6.8 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 227566.9, Avg: 227566.9, Max: 227567.0, Diff: 0.1]
[Ext Root Scanning (ms): Min: 0.2, Avg: 1.8, Max: 6.6, Diff: 6.5, Sum: 7.2]
[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
[Object Copy (ms): Min: 0.0, Avg: 3.1, Max: 4.3, Diff: 4.3, Sum: 12.4]
[Termination (ms): Min: 0.0, Avg: 1.7, Max: 2.4, Diff: 2.4, Sum: 6.8]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[GC Worker Total (ms): Min: 6.6, Avg: 6.7, Max: 6.7, Diff: 0.1, Sum: 26.6]
[GC Worker End (ms): Min: 227573.6, Avg: 227573.6, Max: 227573.6, Diff: 0.1]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.1 ms]
[Other: 1.8 ms]
[Evacuation Failure: 1.4 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.2 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.1 ms]
[Humongous Register: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.0 ms]
[Eden: 1024.0K(10.0M)->0.0B(2048.0K) Survivors: 2048.0K->0.0B Heap: 33.0M(56.0M)->32.9M(56.0M)]
[Times: user=0.06 sys=0.00, real=0.01 secs]
解决线上的问题
结合我们业务的场景及GC日志中的巨型对象分配触发的收集,以及发生to-space exhausted时,GC日志的空间实际上还是充足的,说明是巨型对象Humongou Obj分配过多引起的内存碎片, 最终导致to-space exhausted,所以决定调整线上的region size大小,增大为4m(原先并无配置,根据6G内存推导计算应该是2M),经过一段时间上线运行,不在出现to-space exhausted导致的GC暂停过长
-XX:G1HeapRegionSize=4M
但,对于应用程序来说,最好的做法还是避免Humongou Obj的产生,应该避免大对象的加载。