背景
线上有个流量比较大的服务,qps日常在2000-3000之间,请求方对耗时要求比较高,响应时间要求在300ms以内,服务上线以后,运行也比较平稳。某天夜里,突然有几千个dubbo请求超时,找了几个请求查看调用链路发现超时的地方各不相同,服务依赖的接口方均表示服务没有异常。
这种情况下,一般就是服务自身的问题了,仔细看了下dubbo的超时,发现超时的都是同一个provider,于是看了下这个出问题的provider,看了下这台机器的gc情况,发现超时时有一次full gc,gc耗时达到了1s以上。
full gc时间过长问题排查
gc日志查看
发现full gc时间过长以后,就去对应的机器上把gc log下载了下来,对gc日志进行分析。
36402.792: [GC (Allocation Failure) 36402.792: [ParNew: 2105440K->7584K(2359296K), 0.0124564 secs] 4517143K->2419327K(4980736K), 0.0129547 secs] [Times: user=0.13 sys=0.00, real=0.01 secs]
36402.810: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2411743K(2621440K)] 2419601K(4980736K), 0.0037437 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
36402.814: [CMS-concurrent-mark-start]
36403.622: [CMS-concurrent-mark: 0.800/0.808 secs] [Times: user=5.73 sys=0.26, real=0.81 secs]
36403.623: [CMS-concurrent-preclean-start]
36403.801: [CMS-concurrent-preclean: 0.177/0.178 secs] [Times: user=0.69 sys=0.05, real=0.18 secs]
36403.801: [CMS-concurrent-abortable-preclean-start]
36403.998: [GC (Allocation Failure) 36403.998: [ParNew: 2104736K->7273K(2359296K), 0.0122722 secs] 4516479K->2419078K(4980736K), 0.0127518 secs] [Times: user=0.13 sys=0.01, real=0.02 secs]
36404.672: [CMS-concurrent-abortable-preclean: 0.826/0.870 secs] [Times: user=3.39 sys=0.26, real=0.87 secs]
36404.677: [GC (CMS Final Remark) [YG occupancy: 1097431 K (2359296 K)]36404.677: [GC (CMS Final Remark) 36404.678: [ParNew: 1097431K->7946K(2359296K), 0.0146413 secs] 3509235K->2419797K(4980736K), 0.0151127 secs] [Times: user=0.17 sys=0.00, real=0.02 secs]
36404.693: [Rescan (parallel) , 0.0044661 secs]36404.697: [weak refs processing, 1.6796757 secs]36406.377: [class unloading, 0.1835045 secs]36406.560: [scrub symbol table, 0.0291043 secs]36406.590: [scrub string table, 0.0029833 secs][1 CMS-remark: 2411851K(2621440K)] 2419797K(4980736K), 2.1253433 secs] [Times: user=2.33 sys=0.00, real=2.13 secs]
36406.803: [CMS-concurrent-sweep-start]
36409.458: [GC (Allocation Failure) 36409.458: [ParNew: 2105098K->53837K(2359296K), 0.0675226 secs] 4328721K->2277488K(4980736K), 0.0681548 secs] [Times: user=0.86 sys=0.00, real=0.07 secs]
36410.035: [GC (Allocation Failure) 36410.035: [ParNew: 2150989K->63917K(2359296K), 0.0566114 secs] 4128382K->2041367K(4980736K), 0.0572665 secs] [Times: user=0.71 sys=0.00, real=0.06 secs]
36410.966: [GC (Allocation Failure) 36410.966: [ParNew: 2161069K->51123K(2359296K), 0.0144844 secs] 3640286K->1530366K(4980736K), 0.0149783 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
36411.552: [CMS-concurrent-sweep: 2.287/4.749 secs] [Times: user=23.03 sys=1.41, real=4.75 secs]
36411.552: [CMS-concurrent-reset-start]
36411.558: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
gc日志里面各项代表的意义可以看这篇文章:https://www.jianshu.com/p/a5b3905bf225
我们的老年代收集算法用的是cms算法,现在主要看一下里面会stw的两个阶段耗时,Initial Mark阶段耗时real是0.01秒,耗时很低。然后就是重新标记Final Remark阶段,耗时达到了2.13s,这里面看具体的耗时,weak refs processing, 1.6796757 secs这个占用了比较久的时间,那么这个弱引用处理,究竟在处理什么呢,就要看具体dump出来的堆了。
weak refs processing处理逻辑
weak refs processing的源码如下:
void ReferenceProcessor::process_discovered_references(
BoolObjectClosure* is_alive,
OopClosure* keep_alive,
VoidClosure* complete_gc,
AbstractRefProcTaskExecutor* task_executor) {
NOT_PRODUCT(verify_ok_to_handle_reflists());
assert(!enqueuing_is_done(), "If here enqueuing should not be complete");
// Stop treating discovered references specially.
disable_discovery();
bool trace_time = PrintGCDetails && PrintReferenceGC;
// Soft references
{
TraceTime tt("SoftReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredSoftRefs, _current_soft_ref_policy, true,
is_alive, keep_alive, complete_gc, task_executor);
}
update_soft_ref_master_clock();
// Weak references
{
TraceTime tt("WeakReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredWeakRefs, NULL, true,
is_alive, keep_alive, complete_gc, task_executor);
}
// Final references
{
TraceTime tt("FinalReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredFinalRefs, NULL, false,
is_alive, keep_alive, complete_gc, task_executor);
}
// Phantom references
{
TraceTime tt("PhantomReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredPhantomRefs, NULL, false,
is_alive, keep_alive, complete_gc, task_executor);
}
// Weak global JNI references. It would make more sense (semantically) to
// traverse these simultaneously with the regular weak references above, but
// that is not how the JDK1.2 specification is. See #4126360. Native code can
// thus use JNI weak references to circumvent the phantom references and
// resurrect a "post-mortem" object.
{
TraceTime tt("JNI Weak Reference", trace_time, false, gclog_or_tty);
if (task_executor != NULL) {
task_executor->set_single_threaded_mode();
}
process_phaseJNI(is_alive, keep_alive, complete_gc);
}
}
看JVM源码,weak refs processing主要包括SoftReference、WeakReference、FinalReference、PhantomReference以及JNI Weak Reference这五种Reference对象的处理,处理的主要内容是对之前标记的Reference对象重新处理,重新判断是否需要标记(不标记就是要回收的),如果不标记就需要放到refqueue里,等待java ReferenceHandler线程处理。
所以从工作的主要内容看各种Reference的处理时间跟reference的个数成正比。
dump出堆进行分析
怎么查看堆内都有哪些对象呢
初步定位
简单查看对象数目可以使用下面命令,初步定位问题。
// 这里面7代表要查看进程id
jmap -histo:live 7 > dump_7
这个dump_7中就会有堆中对象的个数和占用空间大小信息,这个由于我没有线上机权限,并没有执行。
dump出堆进行分析
这里使用如下命令可以dump出堆,这个操作会影响生产环境应用,慎用
jmap -dump:format=b,file=headdump_7.hprof 7
dump出来堆文件后,将这个文件用mat打开,可以看到内存泄漏暴露大概的地方:
里面的内容复制出来如下:
The class "com.mysql.cj.jdbc.AbandonedConnectionCleanupThread", loaded by "org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020", occupies 119,594,176 (45.21%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]", loaded by "<system class loader>", which occupies 119,594,016 (45.21%) bytes.
Keywords:
com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020
java.util.concurrent.ConcurrentHashMap$Node[]
查看详情可以看到,connectionFinalizerPhantomRefs占用了其中大部分空间,下一节会分析下AbandonedConnectionCleanupThread类中的connectionFinalizerPhantomRefs:
AbandonedConnectionCleanupThread类分析
AbandonedConnectionCleanupThread类主要变量
AbandonedConnectionCleanupThread类维护了一个单例的单线程池,负责定期清理已回收的connection连接。
public class AbandonedConnectionCleanupThread implements Runnable {
// 存放connection对应的幻象引用集合
private static final Set<ConnectionFinalizerPhantomReference> connectionFinalizerPhantomRefs = ConcurrentHashMap.newKeySet();
// 引用队列,从这个里面取连接出来进行释放
private static final ReferenceQueue<MysqlConnection> referenceQueue = new ReferenceQueue<>();
private static final ExecutorService cleanupThreadExcecutorService;
private static Thread threadRef = null;
private static Lock threadRefLock = new ReentrantLock();
static {
cleanupThreadExcecutorService = Executors.newSingleThreadExecutor(r -> {
Thread t = new Thread(r, "mysql-cj-abandoned-connection-cleanup");
t.setDaemon(true);
// Tie the thread's context ClassLoader to the ClassLoader that loaded the class instead of inheriting the context ClassLoader from the current
// thread, which would happen by default.
// Application servers may use this information if they attempt to shutdown this thread. By leaving the default context ClassLoader this thread
// could end up being shut down even when it is shared by other applications and, being it statically initialized, thus, never restarted again.
ClassLoader classLoader = AbandonedConnectionCleanupThread.class.getClassLoader();
if (classLoader == null) {
// This class was loaded by the Bootstrap ClassLoader, so lets tie the thread's context ClassLoader to the System ClassLoader instead.
classLoader = ClassLoader.getSystemClassLoader();
}
t.setContextClassLoader(classLoader);
return threadRef = t;
});
cleanupThreadExcecutorService.execute(new AbandonedConnectionCleanupThread());
}
}
加入connectionFinalizerPhantomRefs分析
下面看下连接是什么时候加入到connectionFinalizerPhantomRefs的。
// com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
/**
* Tracks the finalization of a {@link MysqlConnection} object and keeps a reference to its {@link NetworkResources} so that they can be later released.
*
* @param conn
* the Connection object to track for finalization
* @param io
* the network resources to close on the connection finalization
*/
protected static void trackConnection(MysqlConnection conn, NetworkResources io) {
threadRefLock.lock();
try {
if (isAlive()) {
ConnectionFinalizerPhantomReference reference = new ConnectionFinalizerPhantomReference(conn, io, referenceQueue);
connectionFinalizerPhantomRefs.add(reference);
}
} finally {
threadRefLock.unlock();
}
}
这个trackConnection方法是在生产连接的时候会被调用,代码如下:
// com.mysql.cj.jdbc.ConnectionImpl
/**
* Creates a connection to a MySQL Server.
*
* @param hostInfo
* the {@link HostInfo} instance that contains the host, user and connections attributes for this connection
* @exception SQLException
* if a database access error occurs
*/
public ConnectionImpl(HostInfo hostInfo) throws SQLException {
try {
// 省去连接生成过程
} catch (CJException e1) {
throw SQLExceptionsMapping.translateException(e1, getExceptionInterceptor());
}
try {
createNewIO(false);
unSafeQueryInterceptors();
// 将连接加入到清理线程中的幻象引用集合中
AbandonedConnectionCleanupThread.trackConnection(this, this.getSession().getNetworkResources());
} catch (SQLException ex) {
cleanup(ex);
// don't clobber SQL exceptions
throw ex;
} catch (Exception ex) {
cleanup(ex);
throw SQLError
.createSQLException(
this.propertySet.getBooleanProperty(PropertyKey.paranoid).getValue() ? Messages.getString("Connection.0")
: Messages.getString("Connection.1",
new Object[] { this.session.getHostInfo().getHost(), this.session.getHostInfo().getPort() }),
MysqlErrorNumbers.SQL_STATE_COMMUNICATION_LINK_FAILURE, ex, getExceptionInterceptor());
}
}
ConnectionFinalizerPhantomReference类
下面看下ConnectionFinalizerPhantomReference连接清理幻象引用的实现,这个类的构造方法中有MysqlConnection 数据库连接对象,NetworkResources 数据库连接资源,ReferenceQueue引用队列,用于在被PhantomReference引用的connection被gc时,JVM会将PhantomReference对象扔到ReferenceQueue中。
/**
* {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
* This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
*/
private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
private NetworkResources networkResources;
ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
super(conn, refQueue);
this.networkResources = networkResources;
}
void finalizeResources() {
if (this.networkResources != null) {
try {
this.networkResources.forceClose();
} finally {
this.networkResources = null;
}
}
}
}
连接关闭的操作
放到引用队列以后,AbandonedConnectionCleanupThread中前面提到的线程池会去执行关闭连接的操作。
public void run() {
for (;;) {
try {
checkThreadContextClassLoader();
Reference<? extends MysqlConnection> reference = referenceQueue.remove(5000);
if (reference != null) {
// 释放连接资源
finalizeResource((ConnectionFinalizerPhantomReference) reference);
}
} catch (InterruptedException e) {
threadRefLock.lock();
try {
threadRef = null;
// Finalize remaining references.
Reference<? extends MysqlConnection> reference;
while ((reference = referenceQueue.poll()) != null) {
finalizeResource((ConnectionFinalizerPhantomReference) reference);
}
connectionFinalizerPhantomRefs.clear();
} finally {
threadRefLock.unlock();
}
return;
} catch (Exception ex) {
// Nowhere to really log this.
}
}
}
finalizeResource实现:
/**
* Release resources from the given {@link ConnectionFinalizerPhantomReference} and remove it from the references set.
*
* @param reference
* the {@link ConnectionFinalizerPhantomReference} to finalize.
*/
private static void finalizeResource(ConnectionFinalizerPhantomReference reference) {
try {
// 调用引用中回收资源的方法
reference.finalizeResources();
reference.clear();
} finally {
// 引用从集合中删除
connectionFinalizerPhantomRefs.remove(reference);
}
}
ConnectionFinalizerPhantomReference#finalizeResources方法实现:
/**
* {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
* This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
*/
private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
private NetworkResources networkResources;
ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
super(conn, refQueue);
this.networkResources = networkResources;
}
void finalizeResources() {
if (this.networkResources != null) {
try {
// 关闭其中的网络连接资源
this.networkResources.forceClose();
} finally {
this.networkResources = null;
}
}
}
}
看到这里大家应该明白,是jdbc为每个connection都生成了一个ConnectionFinalizerPhantomReference,目的是为了当connection对象回收时,顺便回收相关资源。这其实是一个保底操作,是怕connnection资源被上层的连接池或者使用者忘记close,从而导致资源泄漏。
问题解决方式
现在知道了为什么ConnectionFinalizerPhantomReference会产生,下面分析下为什么会有这么多的ConnectionFinalizerPhantomReference,我们使用的是hikari线程池,hikari连接池参数配置介绍见https://www.jianshu.com/p/eb85103f29d6。
我们现有的配置如下:
- 最小连接数没设置,默认会和最大连接数一样,是100
- maxLifeTime用的是默认值,30min,30分钟后连接会被回收
也就是说每30分钟都会产生100个无效的连接,一天会产生4800个ConnectionFinalizerPhantomReference。
这个问题优化的方式有两种:
调整数据库连接池配置
现在连接数100个过多,默认的10个就够用了,这样连接数的产生就会小10倍,然后可以调整maxLifeTime,数据库默认的连接空闲时间是8小时,maxLifeTime可以调整为6个小时,这样每天只会产生40个连接,这个ConnectionFinalizerPhantomReference过多的问题也就能解决了。
删除connectionFinalizerPhantomRefs集合
数据库连接的释放有连接池保证,这个保底机制其实是多余的,可以起个定时任务线程,去清理这个connectionFinalizerPhantomRefs集合。
下面是一个旧版本的清理示例代码。